=Paper=
{{Paper
|id=Vol-3745/paper9
|storemode=property
|title=Revealing the Country-level Preference on Research Methods in the Field of Digital Humanities: From the Perspective of Library and Information Science
|pdfUrl=https://ceur-ws.org/Vol-3745/paper9.pdf
|volume=Vol-3745
|authors=Chengxi Yan,Zhichao Fang
|dblpUrl=https://dblp.org/rec/conf/eeke/YanF24
}}
==Revealing the Country-level Preference on Research Methods in the Field of Digital Humanities: From the Perspective of Library and Information Science==
Revealing the Country-level Preference on Research Methods in the Field of Digital Humanities: From the Perspective of Library and Information Science Chengxi Yan 1,*, Zhichao Fang 2 1 School of Information Resource Management, Renmin University of China, Beijing, China, 100872 2 Digital Humanities Research Center, Renmin University of China, Beijing, China, 100872 Abstract Research method is a very important element for both individual scientific research and country technological development, especially for those interdisciplinary fields like digital humanities (DH) that is close to library and information science (LIS). Considering the scarcity of relevant training data, this study proposes a multi-stage recognition algorithm combining large language model and iterative learning strategy to automatically extract method mentions from DH scientific documents. According to the taxonomy of RMs in existing LIS research, we used dictionary-based mapping technology to transform these entities into RMs and their types. To clarify the differences in RM preferences across different countries, we identified the countries and established the relationship between them with the RMs. A clustering model was utilized to detect country-level RM preference. The experiments showed that quantitative research has played an increasingly central role in the international DH field, especially the experimental methods. Also, there is a distinctive distribution for RM preference among different countries. Keywords Bibliometric analysis, Research methods, Entity recognition, Digital humanities comparative analysis of preference variation on RMs 1. Introduction between countries will be conducive to a more systematic and efficient evaluation of national For the majority of scientific researchers, scientific strength and innovative ability. Moreover, it identifying and understanding the research promotes the country-level awareness of the strengths methods (RMs) in different scientific fields is not and weaknesses in both international academic only a necessary academic basic skill, but also a collaboration and competition. With the rapid significant reference for deeply getting the whole development of entitymetrics-based approaches [3], picture of its development or solving domain the identification and measurement of RMs has problems [1]. As the Stanford Encyclopedia of become one of the hot research issues, especially for Philosophy defined RM as “the means of how the some interdisciplinary fields that integrate a large aims and products of science are achieved, which number of different technologies and methods such as should be distinguished from meta-methodology rule-based and deep neural network-based methods. and the detailed and contextual practices” [2]. However, it remains highly challenging for accurately The distinct characteristics of scientific identifying all different types of RMs, due to the approaches, technical standards and application limitation of training corpus annotated by RM-related norms can be reflected on the different use of RMs entities for supervised models and low prediction across various countries. Therefore, the performance in the unsupervised way. Joint Workshop of the 5th Extraction and Evaluation of Knowledge Entities from Scientific Documents and the 4th AI + Informetrics (EEKE-AII2024), April 23~24, 2024, Changchun, China and Online. ∗ Corresponding author. 0000-0003-1128-550X (C. Yan); 0000-0002-3802-2227 (Z. Fang); EMAIL: 20218113@ruc.edu.cn (C. Yan); fzc0225@163.com (Z. Fang) © Copyright 2024 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings 80 In addition, most previous studies analyzed the 3. Research Design usage frequency of RMs in the field of library and information science (LIS), which ignored hot To answer the RQs, we proposed a research analytical interdisciplinary fields related to LIS like digital framework, as shown in Figure 1, including three main humanities (DH) and the difficulty of their RM steps. The latter two steps are the most crucial classification. As a research area that is inherently components. methodological and heavily indebted to LIS [4], DH is often viewed as a “big tent” [5] including different disciplines with an extensive range of RMs. Considering the interdisciplinary nature of DH and the close relationship between it and LIS, this study adopted DH as the analytical object. According to it, three research questions (RQs) are proposed as the following: RQ1: From a global perspective, does DH research tend to be qualitative, quantitative, or mixed and which countries are the typical representation for these three method types? RQ2: What are the differences in the preference of RMs among different countries? RQ3: Is there a certain pattern for the country-level preference of RMs? 2. Related Work Figure 1: The entire research framework. The approaches of automatic recognition for 3.1 Construction of research dataset RM entities can be divided into two main stages, namely rule-based [6-8] and machine learning- To obtain the original scientific DH papers, similar based technology. For example, Zha adopted the to previous studies [20, 21], we used the subject term- abbreviation patterns and regular expressions to based query strategy (including titles, abstracts and extract candidate algorithmic entities [6]. keywords) as (“digital humanit*” OR “humanit* Considering the weakness of recognition comput*” OR “ehumanit*” OR “electronic* humanit*” performance, more researchers turned to the OR “e-humanit*”) in three well-known databases (Web approaches of machine learning [9-13]. In Zhang of Science Database, Crossref Database and et al.’ study [9], software entities from the PLOS Dimensions Database, DD) to search as many relevant ONE full texts were identified and reorganized documents as possible. The publication timespan is set into five different groups using a clustering between 1900 and 2021. According to the comparative algorithm. Wang et al. constructed a term function results, we found that DD almost covered all the records identification model based on the deep learning from the other two datasets mentioned above (mainly (DL) [10]. journal articles), and more importantly had a wide The classification of RMs can be traced back range of source types such as books, proceeding or to the early study based on the content analysis in preprint papers, and monographs. Thus, the DD the LIS field, such as Jarvelin et al.’s systematic database was selected as the source for the acquisition categorization [14]. Hider [15], Kumpulainen [16], of dataset. There was a total of 4398 articles in the and other LIS scientists who adopted and initial dataset. Next, we deduplicated and deleted optimized Jarvelin’s classification theory of RMs irrelevant document records from it, finally resulting in further reported on the use of RMs in the long- 3469 papers. term evolution of the LIS field. One of the most To identify the country names in each paper, we influential studies was done by Chu and Ke [17], utilized a huge global database called “GRID (Global in which three representative LIS journals were Research Identification Database)”, which is one of the coded computed and analyzed, yielding 16 RMs. most popular open repositories of authoritative research This classification scheme has promoted a variety institutions. We used GRID as an institutional of development for RMs, such as the influence dictionary to link the institution entities where the analysis of algorithmic entities [7], the authors (in DD records) are located in to its exploration of dynamic evolution of RMs in the corresponding countries. The processed dataset Chinese LIS field [18], and the survey of RMs in consists of 1915 papers. the practice projects [19]. 81 3.2 Automatic extraction of method Moreover, we used an agglomerative hierarchical clustering algorithm [23] to distinguish different entities country-level preference patterns. Given to the linguistic complexity (e.g. contextual features of method entities) in DH 4. Result Analysis documents with dual humanities and technological aspects, the identification of RMs may confront For the proportion of three types of RM, quantitative higher technical difficulties. We proposed a three- approach is observed as the most mainstream approach, stage method for automatic entity extraction. which takes 82.29% records in the dataset. Compared Firstly, we constructed prompt-based templates to the qualitative approach, mixed approaches (i.e. using a large language model (GPT-3.5) to 9.93%) turn to be slightly more common in DH complete zero-shot learning, which generated a research. Considering the increasingly growth of DH coarse-grained annotation results of method papers [21], it is believed that quantitative analysis is entities. Secondly, a vocabulary containing normal becoming a more and more important research means. method terminologies and their variations (e.g. Specifically, the dominant position of the Western abbreviation, synonyms) were built through countries for DH studies is indisputable (seen Table 1). manual collection and multiple rounds of expert The United States, which has the most frequent use of evaluation. Inspired by Gupta and Manning’s RMs, has the most significant superiority compared to work [8], we next designed an iterative learning other countries. The United Kingdom and Germany are process to identify and correct method entities on in the second place, especially Germany, which has a the above resulting dataset in the human-in-the- clear position of leadership in the qualitative type. The loop way, where the rule-based transformation and third-rank group are comprised of China, the classification for RMs were performed. During the Netherlands, Canada, Spain, Australia, and Spain. It is process of RM conversion, each method entity was worth noting that as one of the few Asian countries on automatically “translated” to a regular RM in the the list (except for Singapore and Israel), China’s Chu and Ke’s taxonomy [17] if a rule is matched. outstanding performance in mixed and quantitative Considering the pattern of RMs to be centrally research is quite impressive, possibly due to its observed and induced in the field of DH, we diversified use of RMs in the field of DH. divided them into three categories in the wider Table 1 scope in the light of Jarvine et al.’ research [14], The ranking of total number of RM types used by the top 5 namely qualitative research, quantitative research, countries. Note: US, UK, NL, GER and SGP is for short of United and mixed research. For instance, qualitative States, United Kingdom, Netherlands and Germany and Singapore, research includes “content analysis”, respectively. “ethnography and field study”, “historical Rank mixed qualitative quantitative method”, “interview” and etc., while 1 US (24) US (29) US (166) representative quantitative research are 2 UK (10) GER (7) GER (106) “experiment”, “think aloud protocol” and 3 China (9) UK (7) UK (86) 4 NL (7) Australia (5) China (48) “transaction log analysis”, “ bibliometrics”. A study was judged to be “mixed” only when both types of RMs (at least one) are used. 3.3 National preference of RMs For each DH record, we used the regular expression to match all authors and their institutions. A simple program was then designed to map them to the relevant countries based on the organizational names in the GRID database. Considering the issue of multi-path relationship between records and countries, we calculated it according to [22], in which a country used a method once when relevant RMs were mentioned in a paper regardless the occurring frequency of countries that correspond to its authors [22]. The Figure 2: The statistics of national preference of RMs. cumulative counts of RM usage for a country were ultimately defined as its preference of RMs. As a sign of quantitative approach, “experiment”- based approaches are the most frequently utilized (seen 82 in Figure 2). This can be inferred according to the visual analysis on the RM usage, because the relative rates of RM usage can reach 76.98%- 94.94%. Even if the samples are expanded to those countries with a usage frequency greater than 10, it also exceeds 60%. Thus, we temporarily exclude the RM of “experiment”. According to Figure 3, theoretical approaches, as the most important qualitative methods, stand out from the remaining RMs. They are frequently used by most developed countries in Europe and America, such as the United States and the United Kingdom. Chinese DH scholars seem to show Figure 4: Different clusters of countries based on RM-related more keen interest in bibliometric methods, while preference. the Americans prefer theoretical approaches. Both of these two countries show great attention to For the four RMs including “Experiment”, “interview”. “observation”, “transaction log “Theoretical approach”, “Others”, and “Interviews” analysis”, “research diary or journal”, and “focus (i.e, ETOI approaches), the above group division group” are not highly valued by the mentioned based on the machine learning macroscopically and countries. One or Two RMs are used in other clearly provide informative results for different lower-ranked countries, especially Finland, which national preference level of RMs. #1 is the group with is the only country with the most intensive the strong preference of ETOI approaches. #2 and #3 preference for theoretical approaches. Relatively are the medium-level and weak-level preference speaking, the choice of RMs is more evenly groups, respectively. Furthermore, there are some distributed for Canada, indicating that Canadian difference features among the three groups. Content attitude towards qualitative methods may be more analysis is heavily weighted in #1. #2 are more likely tolerant. to use bibliometric analysis in DH scholarship. By contrast, #3 focuses on observational methods. The difference is not only related to the comprehensive performance of each cluster, but is also greatly influenced by the unique members in it whose preference polarity are quite overpowering, such as China’s preference (in #2) of bibliometric methods. 5. Conclusion In this paper, we proposed an optimized iterative learning for RM extraction combing large language models and rule-based transformation to extract and classify RMs from a constructed DH dataset. We compared the differences in the preference use of RMs of different countries, which revealed the distinctive country-level preference patterns of DH. As a preliminary study, our findings can provide certain guidance and assistance for further improving the level of DH development. Figure 3: Preference choices of RMs measured by usage Acknowledgements frequency (up) or usage ratio (down) in representative countries. This project is supported by the grant from National The clustering results of the RM preference is Natural Science Foundation of China (NO. 72204258). shown in Figure 4. There are three clusters in the entire field of DH, namely #1 (United States), #2 (Germany, China, and United Kingdom), and #3 (Other Countries). 83 References [12] Zhang, C., Tian, L., and Chu, H. (2023). Usage frequency and application variety of research [1] Zhang, H., Zhang, C. (2021). Using Full-text methods in library and information science: Content of Academic Articles to Build a Continuous investigation from 1991 to 2021. Methodology Taxonomy of Information Information Processing & Management, 60(6), Science in China. Knowledge Organization, 48, 103507. 2: 126-139. [13] Boland, K., and Krüger, F. (2019). Distant [2] Hepburn, B., & Andersen, H. (2021). Scientific supervision for silver label generation of software Method. In E. N. Zalta (Ed.), The Stanford mentions in social scientific publications. Encyclopedia of Philosophy (Summer 2021). Proceedings of the 4th Joint Workshop on Metaphysics Research Lab, Stanford Bibliometric-enhanced Information Retrieval and University. Natural Language Processing for Digital Libraries, [3] Ding, Y., Song, M., Han, J., et al. (2013). 15-27. Entitymetrics: Measuring the impact of entities. [14] Jarvelin, K., Vakkari, P. (1990). Content analysis PloS One, 8(8): e71416. of research articles in library and information [4] Poole, A. H. (2017). The conceptual ecology of science. Library & Information Science Research, digital humanities. Journal of Documentation, 12, 395–421. 73(1), 91-122. [15] Hider, P., & Pymm, B. (2008). Empirical research [5] Jockers, M. and Worthey, G. (2011), methods reported in high-profile LIS journal Introduction: welcome to the big tent, literature. Library & Information Science Proceeding of Digital Humanities 2011 Research, 30, 108–114. Conference, 6-7. [16] Kumpulainen, K. (1991). Library and information [6] Zha, H., Chen, W., Li, K., & Yan, X. (2019). science research in 1975. Libri, 41(1), 59–76. Mining Algorithm Roadmap in Scientific [17] Chu, H., Ke, Q. (2017). Research methods: What’s Publications, Proceedings of the 25th ACM in the name?. Library & Information Science SIGKDD International Conference on Research, 39(4), 284–294. Knowledge Discovery & Data Mining, 1083- [18] Lou, W., Su, Z., He, J. et al. (2021). A temporally 1092. dynamic examination of research method usage in [7] Wang, Y., Zhang, C. (2020). Using the full-text the Chinese library and information science content of academic articles to identify and community. Information Processing & evaluate algorithm entities in the domain of Management, 58(5), 102686. natural language processing. Journal of [19] Lund, B. D., Wang, T. (2021). An analysis of informetrics, 14(4), 1-21. research methods utilized in five top, practitioner- [8] Gupta, S., Manning, C. D. (2011). Analyzing oriented LIS journals from 1980 to 2019. Journal the dynamics of research by extracting key of Documentation, 77(5), 1196-1208. aspects of scientific papers. Proceedings of 5th [20] Tang, M. C., Cheng, Y. J. and Chen, K. H. (2017). international joint conference on natural A longitudinal study of intellectual cohesion in language processing, 1-9. digital humanities using bibliometric analyses. [9] Zhang, H., Ma, S., and Zhang, C. (2019). Using Scientometrics, 113(2), 985-1008. Full-text of Academic Articles to Find Software [21] Su, F. and Zhang, Y. (2021). Research output, Clusters. Proceedings of ISSI, 2776-2777. intellectual structures and contributors of digital [10] Wang, J., Cheng, Q., Lu, W., et al. (2023). A humanities research: a longitudinal analysis 2005– term function–aware keyword citation network 2020. Journal of Documentation, 78(3), 673-695. method for science mapping analysis. [22] Sidone, O. J. G., Haddad, E. A. and Mena-Chalco, Information Processing & Management, 60(4), J. P. (2017). Scholarly publication and 103405. collaboration in Brazil: The role of geography. [11] Zhang, H., Zhang, C., and Wang, Y. (2024). Journal of the Association for Information Science Revealing the technology development of and Technology, 68(1), 243-258. natural language processing: A Scientific [23] Sasirekha, K. and Baby, P. (2013). Agglomerative entity-centric perspective. Information hierarchical clustering algorithm-a. International Processing & Management, 61(1), 103574. Journal of Scientific and Research Publications, 83(3), 83. 84