BIR 2020 Workshop on Bibliometric-enhanced Information Retrieval Intrication between Information Retrieval and Bibliometrics: the case of scientific domain delineation Michel Zitt Lereco Lab (retired), National Institute for Agronomic Research (INRA), Dept SAE2, Nantes, France mzitt@numericable.fr Abstract. This invited paper summarizes a recent synthesis [8] on the topic of scientific domain delineation. See also [5]. Keywords: Scientific Domain Delineation, Information Retrieval, Bib- liometrics. For decades, Bibliometrics and Information Retrieval developed an increas- ingly close relationship. Information retrieval (namely its application to science and technology) first emphasized data organization, indexing and query systems. Bibliometrics, driven by exploration of science networks (Price) or evaluation of research systems, was boosted by Garfield’s “citation index”, a co-word which symbolizes the match between IR and bibliometrics. Kessler’s work was a beacon of their opening to mapping techniques. Bibliometric investigation, whether for cognitive or for evaluative purposes, typically targets particular “domains” defined at some scale. The meso-level scope covers sub-disciplines, fields, large research areas; the frontier with the micro-level of research fronts or topics is quite fuzzy. Field delineation is usu- ally understood in terms of publication sets. Delineation requirements may be quite basic or more demanding, depending on the study’s ambitions and the type of domain. High-level requirements are associated to large projects on “dif- ficult” fields such as emerging, transversal or complex areas. Operationalization of delineation typically exploits three models, on their own or in combination. All of them rely on quantitative methods in statistics-probability, data analysis (clustering and factor techniques) and graph-network theory. Model A covers institutional classifications and nomenclatures that stem from the cooperation between scientists, S&T policy experts, librarians and in- formation scientists. Examples: aggregate classification from a few national and international bodies; national or international patent office classifications; clas- sifications from producers of disciplinary or general databases — with various Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). BIR 2020, 14 April 2020, Lisbon, Portugal. 129 BIR 2020 Workshop on Bibliometric-enhanced Information Retrieval degrees of disaggregation. A distinction should be made between institutional, national and international schemes of classification. Schemes that are linked to national or institutional evaluation, especially, are politically loaded and even “international” standards may be biased towards the points of view of stakehold- ers. In this respect, the way in which disciplines are grouped or broken down bears witness to social and national stakes at a given period. Given a bibliomet- ric project, predefined categories sometimes happen to meet the requirements, although this is not generally the case on transversal or emerging areas. Model B, in contrast with this sedimented knowledge, relies on ad-hoc IR investigations in a particular query system. In some cases, this appears as a re- finement brought to existing groupings. For example, whereas some databases offer categories based on journals lists, variants may be established using some expertise — the same goes for patent categories. It is quite common to delineate relatively simple domains in this way, keeping in mind the limitations of Brad- fordian ranked lists. A better compromise requires lexical query (or citation) facilities. A domain of science or technology is seldom captured by global terms so that it becomes necessary to combine restricted queries, using the various adaptive techniques of up-to-date IR, in order to reach reasonable precision and recall. Model C is based on bibliometric mapping and clustering, which makes “invisible colleges” visible through a variety of bibliometric networks (actors, texts, citations, etc.). A domain is then expected to emerge as a delimited area on a map or graph at the proper scale. In practice, the mental expectations of users seldom match a single deus ex machina mapping exercise, which inevitably conveys implicit points of view and technical artefacts. Variable adaptive com- binations of top-down phases (cutting the domain out of an extended map of science) and/or bottom-up ones (aggregation of themes in low-level maps) will help. Mapping techniques exploit developments in clustering and community de- tection (graph unfolding: Louvain, Infomap, SLMA. . . ; spectral methods: LSA, pLSA, LDA. . . ) as well as classical clustering and factor analysis. The modes of supervision (by commissioners, bibliometricians, scientists from the domain. . . ) are crucial to validate the findings at various stages. Model A mostly uses “frozen” IR knowledge, which does not spare discussion on its per- tinence in the particular case. Model B queries require experts’ mental represen- tation of the field, if only to avoid missing significant subareas. Model C has to deal with peculiarities of the mapping methods. Border areas, particularly, need supervision. Multi-networks. Scientific IR as well as bibliometrics bring into play the various networks that are explicitly or implicitly created by published S&T: ac- tors, papers, texts, citations, classification categories, sometimes funding infor- mation, etc. Citation certainly remains the iconic network of bibliometrics, while IR tradition mostly exploited nomenclature-based indexes and lexical terms. The Internet era shuffled the cards, with the rise of webometrics. Quasi-citations (URL linkages) migrated from journal bibliometrics to Google; PageRank then irrigated webometrics and bibliometrics. The cognitive theory of scientific infor- Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). BIR 2020, 14 April 2020, Lisbon, Portugal. 130 BIR 2020 Workshop on Bibliometric-enhanced Information Retrieval mation [2] relies on a multinetwork universe. The associated poly-representation of literature offers to combine or confront approaches using different networks — a spontaneous practice in pragmatic bibliometrics — in order to address a wide classe of issues on a particular dataset. Thus, many producers of scientometric indicators rely on the default SCI1 or Scopus classes, or ad-hoc lists of journals, for a rough delimitation. Pragmatic mixes improve the process: lexical or nomenclature index queries; supervised lists of prominent authors, or most cited ones, in multistage and adaptive protocols. For example, a high-precision detection of a core literature may be completed by various recall-oriented extensions (query-expansion, network proximity, etc.) with the possible help of science maps. Protocols involving multi-network design are fruitful. Two networks are generally held as the most precise for thematic analysis: lexical terms (words/phrases from controlled or natural language) and paper-level citations; the author networks are also valuable. Citations and words exhibit analogies and also differences: granularity level, statistical properties, unification issues especially in natural language, dynamic capability — more direct for citation. Citation biases are severe in an evaluation context but some- what less in a mapping context. Lexical analysis is prone to natural language traps, homonymy, synonymy, metaphors, etc. There is huge literature on the properties of the two universes. Sequential word-citation protocols give a few examples of how themes delimi- tation and mapping take advantage of those properties. Among those mentioned in the aforementioned synthesis, the sequence that starts with careful core build- ing, using supervised lexical queries, and that goes on with a quasi-automatic citation-based extension, proved efficient ([6] and related works). Alternatively, one may first compare techniques, by exploring the citation- way and the lexical way in parallel, and then compare and possibly combine the results. We explore this at a cluster delineation level, namely the detection of areas within a given domain (beforehand delimited by hybrid techniques: nanoscience, genomics. . . ). We established that word and citation techniques shape fairly convergent breakdowns but without coincidence, a phenomenon well characterized in cross-maps [7]. Hence, the two approaches are akin but not substitutable, as each one carries its own infometric and sociological point of view. The “full hybrid” approaches, where citation and text tokens are mixed, differ from the mildly hybrid ones. Even in naive form, hybrid bibliographic coupling outperformed simple b.c. in Boyack and Klavans studies [3]. Flexible integrated approaches account for differences in statistical distributions [1]. Full hybridiza- tion privileges the information science perspective and deliberately overviews the sociological dimension of the two networks. Most classical techniques rely on citation keys and, in the lexical realm, on word n-grams or more sophisticated linguistic analysis. Instead, featureless representation of information such as character n-grams or bit sequences, proved efficient in our experience. Related n-gram or compression distances are used. 1 SCI, WoS, WoK product from ISI, Thomson, Thomson-Reuters, now Clarivate. Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). BIR 2020, 14 April 2020, Lisbon, Portugal. 131 BIR 2020 Workshop on Bibliometric-enhanced Information Retrieval The cost is the black-box effect, losing any direct semantic interpretation. This can be operated field by field (e.g. actors/ all text/ references) or for the whole document. In the latter case, the representation is savagely hybrid. Delineation ultimately aims at defining a domain in terms of documents, but this objective may be met through indirect paths (rather than inter-articles coupling) and probabilistic model. Citation-in-context studies, following Small’s investigations ([4] and many others) changed the granularity of hybrid think- ing in bibliometrics. The vizualisation of citation contexts in online engines is now classical. Narrowing the context of particular citations also has various ap- plications in bibliometric studies: section affiliation of references (introduction, methods, findings. . . ); a step further, association or mutual labelling of words and reference within linguistic units (sentences, paragraphs. . . ). From this fine granularity, hybridization techniques expect a strong gain in precision, whatever the aim, historiography, classification, etc. Prior decomposition of an article in smaller lexical units might enhance the convergence of cocitation and coword clustering in the afore-mentioned cross-mapping exercise. As previously stated, the frontier with micro-level small topic delineation is fuzzy. Meso-scale bibliometric studies, which involve a breakdown into many topics, cannot generally afford detailed supervision at this micro-level, where themes are detected by some automatic data analysis. However, using cross- maps or hybrid designs, may be helpful to look for robust “strong forms” or at least cores in multinetworks, resisting to change of settings and vantage points. Supervision is nevertheless necessary at the key stages, and costly. For both validity, acceptability and involvement of actors, the careful dimensioning and conduct of supervision is a key factor in large studies. References An extensive bibliography is found in [8]. Below, just a few indications on our works, and a few classical milestones. [1] Boyack, K.W., Klavans, R.: Co-citation analysis, bibliographic coupling, and direct citation: Which citation approach represents the research front most accurately? Journal of the American Society for Information Science and Technology 61(12), 2389–2404 (2010), doi:10.1002/asi.21419 [2] Ingwersen, P.: Cognitive perspectives of information retrieval interaction: Elements of a cognitive IR theory. Journal of Documentation 52(1), 3–50 (1996), doi:10.1108/eb026960 [3] Janssens, F., Glänzel, W., De Moor, B.: Dynamic hybrid clustering of bioin- formatics by incorporating text mining and citation analysis. In: Berkhin, P., Caruana, R., Wu, X., Gaffney, S. (eds.) KDD’07: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining. pp. 360–369. Association for Computing Machinery (2007), doi:10.1145/1281192.1281233 [4] Small, H.: Co-citation context analayses and the structure of paradigms. Journal of Documentation 36(3), 183–196 (1980), doi:10.1108/eb026695 Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). BIR 2020, 14 April 2020, Lisbon, Portugal. 132 BIR 2020 Workshop on Bibliometric-enhanced Information Retrieval [5] Zitt, M.: Meso-level retrieval: IR-bibliometrics interplay and hybrid citation- words methods in scientific fields delineation. Scientometrics 102(3), 2223– 2245 (2015), doi:10.1007/s11192-014-1482-5 [6] Zitt, M., Bassecoulard, E.: Delineating complex scientific fields by an hybrid lexical-citation method: An application to nanosciences. Information Process- ing & Management 42(6), 1513–1531 (2006), doi:10.1016/j.ipm.2006.03.016 [7] Zitt, M., Lelu, A., Bassecoulard, E.: Hybrid citation-word representations in science mapping: Portolan charts of research fields? Journal of the Amer- ican Society for Information Science and Technology 62(1), 19–39 (2011), doi:10.1002/asi.21440 [8] Zitt, M., Lelu, A., Cadot, M., Cabanac, G.: Bibliometric delineation of sci- entific fields. In: Glänzel, W., Moed, H.F., Schmoch, U., Thelwall, M. (eds.) Springer Handbook of Science and Technology Indicators, chap. 2, pp. 25–68. Springer, Berlin (2019), doi:10.1007/978-3-030-02511-3 2 Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). BIR 2020, 14 April 2020, Lisbon, Portugal. 133