Citing Foreign Language Sources : an Analysis of the S2ORC Dataset Marc Bertin1,∗,† , Iana Atanassova2,† 1 Université Claude Bernard Lyon 1, ELICO 43 Boulevard du 11 novembre 1918 69622 Villeurbanne cedex, France 2 Université de Franche-Comté, CRIT 30 rue Mégevand, F-25000 Besançon, France Institut Universitaire de France (IUF), France Résumé In this article we investigate the multilingualism of references in the Semantic Scholar Open Research Corpus (S2ORC). While this dataset contains peer-reviewed papers from different disciplines, written mainly in English, we identify the languages used in the references, their linguistic groups and their distribution over time. The results allow us to observe the dominance of English in science, as well as the relative proportions of the other 34 languages, representing over 2.4 million of the cited sources, and their linguistic groups. We show that the relative share of non-English citations has been increasing since 2000. However, the vast majority of citations in non-English publications are to English sources. We discuss some of the limitations of this study, mainly related to the nature of the dataset, which is biased towards English, and the quality of the language detection tool. Keywords Foreign language sources, Language detection, English, S2ORC, Bibliographic References, Bibliometrics 1. Introduction The English language has gradually gained a dominant position in science [1]. At the same time, the question of the use of different languages in publications remains a subject of scientific debate, which also finds an echo at the political level. It has been shown by Kirillova in 2019 [2] that the publication of articles in English in a given country is related to its scientific activity as well as to the size of the country. These results reflect the desire of large countries to maintain their national language as the language of science, while small non-English speaking countries aim to reach international standards. . BIR 2023 : 13th International Workshop on Bibliometric-enhanced Information Retrieval at ECIR 2023, April 2, 2023 ∗. Corresponding author. † . These authors contributed equally. . Envelope-Open marc.bertin@univ-lyon1.fr (M. Bertin) ; iana.atanassova@univ-fcomte.fr (I. Atanassova) . GLOBE https://elico-recherche.msh-lse.fr/membres/marc-bertin (M. Bertin) ; iana-atanassova.github.io/ (I. Atanassova) . Orcid 0000-0003-1803-6952 (M. Bertin) ; 0000-0003-3571-4006 (I. Atanassova) . © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). . CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) 66 CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings In addition, Moskaleva and Akoev [3] showed that non-English native language publications are less read and cited than those in English outside the home country. They also observed that the ranking of journals is correlated with the share of English publications for multilingual journals. In terms of content, Di Bitetti compared the abstracts of a bilingual journal showing that there were no differences in aspects such as quality, general interest, etc. between articles published in English and in other languages in these journals [4]. The work of Smirnova and Lillis in 2022 is based on a corpus comparing research articles written in Russian with those written in English in the following disciplines : philosophy, sociology and economics [5]. At the micro level, the article analyses the changes in citations in the English and Russian texts. At the macro level, the article raises questions about what is considered ”citation-worthy” in different geolinguistic contexts and considers the consequences of citation brokerage and knowledge production practices and circulation on a global scale. The work of Angulo en 2021 [6] shows that science based on a single language, particularly English, can be a barrier to knowledge transfer that can lead to bias in the provision of global models. Experience shows that including non-English sources can reduce bias in understanding and enrich scientific knowledge. Faced with this hegemony, some publishers have begun to advocate bilingual journals [7]. In this article we address the problem of multilingualism in publications through a corpus- based experiment using the Semantic Scholar Open Research Corpus (S2ORC) [8]. Our objective is to examine foreign language (non-English) references in this corpus, and observe their nature and distribution. S2ORC is a large dataset of full-text peer-reviewed research articles, mainly written in English. We chose this dataset for our experiment because it focuses on English research articles and spans many academic disciplines. S2ORC includes articles from a wide range of scientific fields including medicine, biology, chemistry, engineering, computer science, physics, materials science, mathematics, psychology, economics, political science, business, geology, sociology, geography, environmental science, art, history and philosophy. Other large datasets of research articles exist, such as the thematic dataset COVID19 [9], Climate change [10] or the multilingual dataset ISTEX [11], which have different coverage. When studying references to foreign language sources, we should take into account the problem of the multiplicity of scripts. Indeed, if we look at an article written in English, it may contain references in other languages that are expressed either in the native alphabet of the foreign language or in Latin characters. Traditionally, two types of operations have been used to romanise languages with non-Latin scripts. On the one hand, transliteration is the operation that consists in replacing each grapheme of one writing system by a grapheme or group of graphemes of another system, regardless of the pronunciation. On the other hand, transcription is the opposite of transliteration. In transcription, each phoneme of a language is replaced by a grapheme or group of graphemes from one writing system to another. Transliteration is based on national and international standards 1 . 1. E.g. Armenian : ISO 9985 : 1996 ; Macedonian, Turkish, Russian, Ukrainian, Belarusian, Bulgarian, non-Slavic languages in Cyrillic, based on ISO 9 : 1995 ; Chinese with NF ISO 7098 : 1992 ; Georgian, using ISO 9984 : 1996 ; Greek with ISO 843 : 1997 ; Hebrew, Yiddish or Syriac, which is a Hebrew script based on the NF ISO 259-2 : 1995 standard ; or Thai, which uses the ISO 11940 : 1998 standard. 67 2. Methods 2.1. Dataset We use the full S2ORC version 1 dataset, which contains approximately 81 million open access articles published up to 2020. While the dataset is intended to include only English language articles, a closer look reveals that a small proportion of the articles are in other languages. We show this below. The dataset is available in json format. Each article is identified by its paper_id . For our experiment, we extracted the metadata of articles (title and year) and the metadata of the bibliographic references they contain (titles, years, journals, etc.). 2.2. Processing Pipeline The language of each article and bibliographic reference was detected using Google’s Compact Language Detector v3 (gcld3 ) Python library 2 , which implements a neural network model for language identification. We used the title of the reference as input. Several outputs are provided by gcld3 , including the code 3 of the most likely language and the likelihood score for that language. The titles of the bibliographic references were used for the language detection. In order to obtain a good quality sample, we discarded references for which the probability score was lower than 0.95. This is the case, for example, when the title is too short to identify the language with certainty. References with missing metadata, e.g. missing year, were also ignored (less than 1 % of all references). Furthermore, papers and references with a year before 1950 or after 2020 were ignored, as such values are most likely due to typing errors in the dataset (less than 0.5% of all references). The metadata for the citing paper (title and year) was obtained by using its paper_id in S2ORC. The language of the citing paper was determined in the same way as for the references, using its title. The quality of the language detection varies for some of the poorly endowed languages. The choice of the gcld3 library was done after testing several other libraries for language detection, e.g. spacy lang-detect . We found that for our task, gcld3 provides better results and is much faster. We manually evaluated language detection by examining a subset of 100 titles for each language. If the observed precision was below 50 % we excluded that language 4 from our experiment. For the majority of the languages that remained the precision is above 75%. The gcld3 recognises cases of Latin transliteration and assigns a language code followed by -Latn , e.g. ru-Latn stands for Russian text transliterated with Latin characters. In our experiment, we do not make a distinction between references that are transliterated and those that are written in the native script. 2. https://github.com/google/cld3 3. The language codes use the IETF BCP 47 language tags, which combine several standards : ISO 639, ISO 15924, ISO 3166-1 and UN M.49. F. The subtags are maintained by the IANA Language Subtag Registry. 4. This applies to the following languages : af, ar, az, be, bg-Latn, ceb, co, cy, eo, et, eu, fil, fy, ga, gd, gl, ha, haw, hi, ht, ig, jv, kk, ku, ky, lb, mg, mi, mk, mn, ms, mt, ne, ny, sm, sn, so, sq, st, su, sw, uz, vi, xh, yi, yo, zu. 68 Table 1 Linguistic Groups and Language Tags Linguistic Group IETF BCP 47 language tag Afro-Asiatic Cushitic so Afro-Asiatic Semitic ha,mt Austroasiatic ceb,fil,haw,id,jv,mg,mi,ms,sm,su,tl,vi Auxiliary (Esperanto) eo Basque eu Central Semitic ar,he,iw Creole ht English en Indo-European Albanian sq Indo-European Baltic lt,lv, Indo-European Celtic cy,ga,gd Indo-European Germanic af,da,de,fy,is,lb,nl,no,sv,yi Indo-European Hellenic el,el-Latn,gr Indo-European Indo-Iranian bn,hi,ku,ne,tg Indo-European Romance ca,co,es,fr,gl,it,la,pt,ro Indo-European Slavic be,bg,bg-Latn,cs,cz,hr,mk,pl,ru,ru-Latn,sk,sl,sr,uk Japanese ja,ja-Latn Koreanic ko Mongolian mn Niger-Congo ig,ny,sn,st,sw,xh,yo,zu Sino-Tibetan Sinitic zh,zh-Latn Thai th Turkic az,kk,ky,tr,uz Uralic Finno-Ugric et,fi,hu West Semitic am 2.3. Composition of the Linguistic Groups In order to study how the different language groups are represented in the dataset, we have classified the languages with their language tags into language groups. The summary is presented in table 1, which contains all language tags identified as present in the dataset. Only English was not associated with its group (the Indo-European Germanic languages), but we have separated it from the other languages because it is the dominant language of the corpus. Languages that were excluded from the experiment because of the poor quality of the language detection are presented in red. 3. Results As a result of the language detection, after applying the above criteria, we obtained a subset of S2ORC containing a total of 5.9 million articles with 109.9 million references for which the language was detected with a probability score above 0.95. 69 3.1. Languages present in the corpus Although the dataset is intended to contain only English research articles, we found 35,463 articles that were identified as being written in other languages, out of a total of 5,934,799 articles (0.60 %). Figure 1 shows a bar chart of the number of articles found for each language. Figure 1 : Non-English articles in the S2ORC dataset Most of the articles are in Latin, and this result may be biased because titles in the biomedical domain may be misclassified as Latin text. For the remaining languages, we manually sampled and checked the presence of such articles in the dataset. In our study, we have taken this result as an opportunity to work with a subset of multilingual (non-English) research articles and study their references. Overall, we found that the dataset contains articles in 32 languages (including English) and references to sources in 35 languages (including English). As might be expected, the vast majority of references in the dataset are to English sources. They account for 107,437,865 references out of a total of 109,838,938 references (97.81 %). The remaining 2,401,073 references are to non-English sources. Figure 2 shows a bar chart of the number of references to sources in the different languages. To gain a better understanding of the types of foreign language sources cited, we extracted titles of non-English sources cited in English articles and translated some of them for analysis and discussion. The translation is intended to be informative. The results are shown in the table 2. Languages that were excluded from the experiment because of the poor quality of the language detection are presented in red. We can see that the titles correspond to studies that have local significance and dimension, relate to a particular country or geographical area, and are written in their native language. For example, the title in haw (Hawaiian) represents a study on the history of ”Kahalu’u and Keauhou”. The example in sq (Albanian) is a study of 70 Figure 2 : References to non-English sources in the S2ORC dataset hydrocarbon energy in Albania, and the example in mk (Macedonian) is about geological studies in the villages of Kosel and Pesočan. 3.2. Evolution over time We have studied the distribution of citations to non-English sources with respect to the year of reference and with respect to the year of publication of the citing article. Figure 3 shows the relative proportion of non-English sources cited by articles published between 1950 and 2020. The numbers on the horizontal axis represent the average number of sources cited per article in our dataset. The nine most frequent languages are shown in different colours, and the last group contains all other languages. We can see that between 1960 and 1980, the non-English references cited were mostly in French and German. Since 2000, however, their share has been decreasing, with the appearance of a large number of sources cited in Portuguese and Spanish. Overall, the relative share of non-English references has increased since 2000. 3.3. Use of English in relation with other languages To study how foreign language sources are cited in English articles, we looked at the subset of references from English articles to non-English sources. Figure 4 shows the relative proportions of different language groups cited in English articles. The left-hand side of the figure shows the linguistic groups of the citing articles, while the right-hand side shows the linguistic groups of the references. The majority of references are from the Indo-European Romance and Germanic 71 Table 2 Examples of titles for non-English references Lang. tag Title of article haw He Mo’olelo ’Äina : Kahalu’uKaulana i ka Wai Puka iki o Helani a me Keauhou-I ka ’Ili’ili Nehe : A History of Kahalu’u and Keauhou Ahupua’a District of Kona, Island of Hawai’i. Kumu Pono Associates LLC transl. A History of Kahalu’u and Keauhou Ahupua’a District of Kona, Island of Hawai’i. Teacher Right Associates LLC hi-Latn Dolpa Jillama Yarsagumba Sankalan Tatha Byebasthapan Ek Parichaye transl. Yarsagumba collection and settlement an introduction in Dolpa district is Sjávarnytjar við Ísland. Mál og menning transl. Seafood off Iceland. Language and culture lt Lietuvos nekilnojamojo turto rinka : nekilnojamojo turto ir sta tybos sąnaudų kainų analizė transl. Lithuanian real estate market : analysis of real estate and construction cost prices lv Latviešu tēlotāja māksla 1860-1940. Rīga : Zinātne transl. Latvian painter’s art 1860-1940. Riga : Science mi Te Poha o Tohu Raumati Te Rūnanga o Kaikōura Environmental Management Plan. Te Rūnanga o Kaikōura, Takahanga Marae Kaikōura transl. Te Poha o Tohu Rāmati Te Rūnanga of Kaikōura Environmental Management Plan. The Kaikōura Cabinet, Kaikōura Field Events mk Геолошки извештај за сулфатните појави на селата Косел, Песочани. Вапила и Влгоште. Стручен фонд на Геолошки завод -Скопје transl. Geological report on sulfate occurrences in the villages of Kosel, Pesočani. Vapila and Vlgoshte. Professional fund of Geological Institute - Skopje ro Zece mii de culturi, o singură civilizaţie. Spre geomodernitatea secolului XXI transl. Ten thousand cultures, one civilization. Towards the geomodernity of the 21st century ru-Latn Nauchno-tekhnologicheskie prioritety dlya modernizatsii rossiyskoy ekonomiki S&T Priorities for Modernization of Russian Economy transl. Scientific and technological priorities for the modernization of the Russian economy S&T Prioritize for Moderni- zation of Russian Economy sk Vecná a časová zmena motivácie riadiacich zamestnancov v Slovenských elektrárňach a. s. Mochovce. On-line odborný časopis Manažment v teórii a praxi transl. Material and temporal change in the motivation of management employees in Slovenské elektrárňa a. with. Mochovce. On-line professional magazine Management in theory and practice sq Konsumi i energjisë së hidrokarbureve në Shqipëri dhe në Botë në vitet transl. Energy consumption of hydrocarbons in Albania and in the world in years sw Anayedhulimiwa asipopambana na huyo dhalimu, yeye ataendelea kuteseka wakati dhalimu atastarehe kwa amani’, An-Nuur transl. If the oppressed does not fight the oppressor, he will continue to suffer while the oppressor rests in peace’, An-Nuur tg Садриддин Айнӣ ва баъзе масъала�ои инкишофи забони адабии то�ик transl. Sadriddin Ainy and some issues of Tajik literary language development uk Напрями формування інноваційної системи нового технологічного укладу в Україні transl. Directions of the formation of the innovative system of the new technological order in Ukraine uk Стан корупції в Україні. Порівняльний аналіз загальнонаціональних досліджень transl. The state of corruption in Ukraine. Comparative analysis of national studies vi Những cây thuốc và vị thuốc Việt Nam, Nhà xuất bản Khoa học Kỹ thuật transl. Vietnamese medicinal plants and herbs, Science and Technology Publishing House zh-Latn Jiating yinsu dui weichengnianren fanzui de yingxiang ji duice shizhengyanjiu (Empirical study on the impacts of family factor on juvenile delinquency and solution : A case study in Chongqing) transl. An Empirical Study on the Influence of JIA Listening Factors on Juvenile Delinquency and Countermeasures groups, which are the closest languages to English. The Indo-European Slavic languages come next, and all the other groups have very small proportions of citations. Finally, we examined the subset of references from non-English articles. The results are shown in figure 5. Again, the left side of the figure represents the language groups of the citing articles, while the right side represents the language groups of the references. It is interesting to note the dominance of English in this group, as we see that the majority of citations in these articles 72 Figure 3 : Relative part of non-English sources with respect to the year of the publication Figure 4 : English articles citing non English sources 73 are to English sources. In addition, articles written in all other language groups cite a majority of English sources. In particular, a significant proportion of publications in Indo-European Romance languages (other than English) cite sources in the same language group, and the same is true for the other Indo-European groups. Figure 5 : References in non English articles 4. Discussion and Limitations The results presented in this study may be biased by several different factors. Firstly, the quality of language detection may vary between different languages, particularly for poorly endowed languages which may have a low recognition rate. Manual cleaning and evaluation would be required to produce better quality data. Secondly, the use of titles as proxies to detect the language of a publication is not optimal, as titles can be very short or contain technical terms that may wrongly point to Latin or English. However, given the size of the dataset and the data available for references, this was the only way to identify the languages of the sources. Another possibility would be to use titles and abstracts. This study could therefore be improved by harvesting abstracts for the references in the dataset. The choice of the S2ORC dataset introduces an important bias related to the English language. As S2ORC is intended to include only English language research articles, the subset of non- English language articles that we analysed may not be representative of these languages, as it is 74 made up of publications that are included in S2ORC. The S2ORC dataset was constructed by retaining only papers identified as English using the cld2 tool run over titles and abstracts [8]. This introduces a bias towards English and other languages that use the Latin alphabet. We can suppose that such a tool is likely to perform better when it comes to excluding all non-latin non-English articles from the dataset than the articles written in Latin script. This may lead to an over-representation of Indo-European Romance languages. Furthermore, as the extraction pipeline used for S2ORC is optimised for the English language, and the extraction of citations to foreign language papers has not been evaluated for the creation of the dataset. Thus, it can be expected that the extracted references are biased towards English language references. In general, the choice of sources to be cited in a publication and their languages is a complex process influenced by many factors, such as the languages spoken by co-authors, the subject of the study, and so on. The editorial requirements of some journals may favour English language references or require the titles of foreign language references to be translated into the language of the publication. 5. Conclusion and Perspectives We carried out an analysis of the Semantic Scholar Open Research Corpus (S2ORC) and identified the languages of the research articles and their references based on their titles. While the vast majority are in English, we found articles in 44 different languages and references in 54 languages. We observed their linguistic groups and their distribution over time between 1950 and 2020. The results allow us to observe the dominance of English in science, where the vast majority of citations in non-English publications are to English sources. We also show that the relative share of non-English citations will increase from 2020 onwards. Evaluating and improving the quality of language detection can provide us with more reliable results in the future. In the long run, this work aims to propose a typology of foreign language references in order to better understand the way they are used in publications. Indeed, one of the issues that interest bibliometricians today is the reflection on the context of citation. The classification of citation contexts according to multilingual criteria has not been addressed in the literature to our knowledge. To this end, an extension of this approach will focus on national and other multilingual corpora. It also seems relevant to study such references in terms of their place in publications and their linguistic contexts. Acknowledgments This work was supported by grant number ANR-20-CE38-0003-01 and grant number ANR- 21-CE38-0003-01. Références [1] P. S. Rao, The role of english as a global language, Research Journal of English 4 (2019) 65–79. 75 [2] O. V. Kirillova, Publication language and the journal scientometric indicators in glo- bal citation databases, Science Editor and Publisher 4 (2019) 21–33. doi :10.24069/ 2542- 0267- 2019- 1- 2- 21- 33 . [3] O. Moskaleva, M. Akoev, Non-english language publications in citation indexes - quantity and quality, CoRR abs/1907.06499 (2019). URL : http://arxiv.org/abs/1907.06499. [4] M. S. Di Bitetti, J. A. Ferreras, Publish (in english) or perish : The effect on citation rate of using languages other than english in scientific publications, Ambio 46 (2017) 121–127. doi :10.1007/s13280- 016- 0820- 7 . [5] N. Smirnova, T. Lillis, Citation in global academic knowledge making : A paired text history methodology for studying citation practices in english and russian, Journal of English for Research Publication Purposes 3 (2022) 78–108. [6] E. Angulo, C. Diagne, L. Ballesteros-Mejia, T. Adamjy, D. A. Ahmed, E. Akulov, A. K. Baner- jee, C. Capinha, C. A. Dia, G. Dobigny, V. G. Duboscq-Carra, M. Golivets, P. J. Haubrock, G. Heringer, N. Kirichenko, M. Kourantidou, C. Liu, M. A. Nuñez, D. Renault, D. Roiz, A. Ta- heri, L. N. Verbrugge, Y. Watari, W. Xiong, F. Courchamp, Non-english languages enrich scientific knowledge : The example of economic costs of biological invasions, Science of The Total Environment 775 (2021) 144441. doi :10.1016/j.scitotenv.2020.144441 . [7] M. D. Rosselli, Moving towards english, Acta Neurológica Colombiana 36 (2020) 1–2. doi :10.22379/24224022270 . [8] K. Lo, L. L. Wang, M. Neumann, R. Kinney, D. Weld, S2ORC : The semantic scholar open research corpus, in : Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, Online, 2020, pp. 4969–4983. doi :10.18653/v1/2020.acl- main.447 . [9] L. L. Wang, K. Lo, Y. Chandrasekhar, R. Reas, J. Yang, D. Burdick, D. Eide, K. Funk, Y. Katsis, R. M. Kinney, Y. Li, Z. Liu, W. Merrill, P. Mooney, D. A. Murdick, D. Rishi, J. Sheehan, Z. Shen, B. Stilson, A. D. Wade, K. Wang, N. X. R. Wang, C. Wilhelm, B. Xie, D. M. Raymond, D. S. Weld, O. Etzioni, S. Kohlmeier, CORD-19 : The COVID-19 open research dataset, in : Proceedings of the 1st Workshop on NLP for COVID-19 at ACL 2020, Association for Computational Linguistics, Online, 2020. URL : https://www.aclweb.org/anthology/2020. nlpcovid19-acl.1. [10] R. Grundmann, R. Krishnamurthy, The discourse of climate change : A corpus-based approach, Critical approaches to discourse analysis across disciplines 4 (2010) 125–146. [11] P. Cuxac, A. Collignon, ISTEX, un projet national d’archives documentaires : au-delà de l’accès au texte intégral, l’enrichissement des données par méthodes de fouille de textes, in : Analyser la science : les bibliothèques numériques comme objet de recherche in 85ème Congrès ACFAS, Montréal, Canada, 2017. URL : https://hal.science/hal-01869036. 76