1. Introduction

Science in China. Knowledge Organization

Revealing the Country-level Preference on Research Methods in the Field of Digital Humanities: From the Perspective of Library and Information Science

Chengxi Yan

Zhichao Fang

0 0 Digital Humanities Research Center, Renmin University of China , Beijing, China, 100872 1 School of Information Resource Management, Renmin University of China , Beijing, China, 100872

2024

48 103507 0000 0003

Research method is a very important element for both individual scientific research and country technological development, especially for those interdisciplinary fields like digital humanities (DH) that is close to library and information science (LIS). Considering the scarcity of relevant training data, this study proposes a multi-stage recognition algorithm combining large language model and iterative learning strategy to automatically extract method mentions from DH scientific documents. According to the taxonomy of RMs in existing LIS research, we used dictionary-based mapping technology to transform these entities into RMs and their types. To clarify the differences in RM preferences across different countries, we identified the countries and established the relationship between them with the RMs. A clustering model was utilized to detect country-level RM preference. The experiments showed that quantitative research has played an increasingly central role in the international DH field, especially the experimental methods. Also, there is a distinctive distribution for RM preference among different countries.

1. Introduction

For the majority of scientific researchers, identifying and understanding the research methods (RMs) in different scientific fields is not only a necessary academic basic skill, but also a significant reference for deeply getting the whole picture of its development or solving domain problems [1]. As the Stanford Encyclopedia of Philosophy defined RM as “the means of how the aims and products of science are achieved, which should be distinguished from meta-methodology and the detailed and contextual practices” [2].

The distinct characteristics of scientific approaches, technical standards and application norms can be reflected on the different use of RMs across various countries. Therefore, the comparative analysis of preference variation on RMs between countries will be conducive to a more systematic and efficient evaluation of national scientific strength and innovative ability. Moreover, it promotes the country-level awareness of the strengths and weaknesses in both international academic collaboration and competition. With the rapid development of entitymetrics-based approaches [3], the identification and measurement of RMs has become one of the hot research issues, especially for some interdisciplinary fields that integrate a large number of different technologies and methods such as rule-based and deep neural network-based methods. However, it remains highly challenging for accurately identifying all different types of RMs, due to the limitation of training corpus annotated by RM-related entities for supervised models and low prediction performance in the unsupervised way.

In addition, most previous studies analyzed the usage frequency of RMs in the field of library and information science (LIS), which ignored hot interdisciplinary fields related to LIS like digital humanities (DH) and the difficulty of their RM classification. As a research area that is inherently methodological and heavily indebted to LIS [4], DH is often viewed as a “big tent” [5] including different disciplines with an extensive range of RMs. Considering the interdisciplinary nature of DH and the close relationship between it and LIS, this study adopted DH as the analytical object. According to it, three research questions (RQs) are proposed as the following: RQ1: From a global perspective, does DH research tend to be qualitative, quantitative, or mixed and which countries are the typical representation for these three method types? RQ2: What are the differences in the preference of RMs among different countries? RQ3: Is there a certain pattern for the country-level preference of RMs?

2. Related Work The approaches of automatic recognition for

RM entities can be divided into two main stages, namely rule-based [6-8] and machine learningbased technology. For example, Zha adopted the abbreviation patterns and regular expressions to extract candidate algorithmic entities [6]. Considering the weakness of recognition performance, more researchers turned to the approaches of machine learning [9-13]. In Zhang et al.’ study [9], software entities from the PLOS ONE full texts were identified and reorganized into five different groups using a clustering algorithm. Wang et al. constructed a term function identification model based on the deep learning (DL) [10].

The classification of RMs can be traced back to the early study based on the content analysis in the LIS field, such as Jarvelin et al.’s systematic categorization [14]. Hider [15], Kumpulainen [16], and other LIS scientists who adopted and optimized Jarvelin’s classification theory of RMs further reported on the use of RMs in the longterm evolution of the LIS field. One of the most influential studies was done by Chu and Ke [17], in which three representative LIS journals were coded computed and analyzed, yielding 16 RMs. This classification scheme has promoted a variety of development for RMs, such as the influence analysis of algorithmic entities [7], the exploration of dynamic evolution of RMs in the Chinese LIS field [18], and the survey of RMs in the practice projects [19].

3. Research Design

To answer the RQs, we proposed a research analytical framework, as shown in Figure 1, including three main steps. The latter two steps are the most crucial components.

3.1 Construction of research dataset

To obtain the original scientific DH papers, similar to previous studies [20, 21], we used the subject termbased query strategy (including titles, abstracts and keywords) as (“digital humanit*” OR “humanit* comput*” OR “ehumanit*” OR “electronic* humanit*” OR “e-humanit*”) in three well-known databases (Web of Science Database, Crossref Database and Dimensions Database, DD) to search as many relevant documents as possible. The publication timespan is set between 1900 and 2021. According to the comparative results, we found that DD almost covered all the records from the other two datasets mentioned above (mainly journal articles), and more importantly had a wide range of source types such as books, proceeding or preprint papers, and monographs. Thus, the DD database was selected as the source for the acquisition of dataset. There was a total of 4398 articles in the initial dataset. Next, we deduplicated and deleted irrelevant document records from it, finally resulting in 3469 papers.

To identify the country names in each paper, we utilized a huge global database called “GRID (Global Research Identification Database)”, which is one of the most popular open repositories of authoritative research institutions. We used GRID as an institutional dictionary to link the institution entities where the authors (in DD records) are located in to its corresponding countries. The processed dataset consists of 1915 papers.

3.2 Automatic extraction of method entities Given to the linguistic complexity (e.g.

contextual features of method entities) in DH documents with dual humanities and technological aspects, the identification of RMs may confront higher technical difficulties. We proposed a threestage method for automatic entity extraction. Firstly, we constructed prompt-based templates using a large language model (GPT-3.5) to complete zero-shot learning, which generated a coarse-grained annotation results of method entities. Secondly, a vocabulary containing normal method terminologies and their variations (e.g. abbreviation, synonyms) were built through manual collection and multiple rounds of expert evaluation. Inspired by Gupta and Manning’s work [8], we next designed an iterative learning process to identify and correct method entities on the above resulting dataset in the human-in-theloop way, where the rule-based transformation and classification for RMs were performed. During the process of RM conversion, each method entity was automatically “translated” to a regular RM in the Chu and Ke’s taxonomy [17] if a rule is matched.

Considering the pattern of RMs to be centrally observed and induced in the field of DH, we divided them into three categories in the wider scope in the light of Jarvine et al.’ research [14], namely qualitative research, quantitative research, and mixed research. For instance, qualitative research includes “content analysis”, “ethnography and field study”, “historical method”, “interview” and etc., while representative quantitative research are “experiment”, “think aloud protocol” and “transaction log analysis”, “ bibliometrics”. A study was judged to be “mixed” only when both types of RMs (at least one) are used.

3.3 National preference of RMs For each DH record, we used the regular

expression to match all authors and their institutions. A simple program was then designed to map them to the relevant countries based on the organizational names in the GRID database. Considering the issue of multi-path relationship between records and countries, we calculated it according to [22], in which a country used a method once when relevant RMs were mentioned in a paper regardless the occurring frequency of countries that correspond to its authors [22]. The cumulative counts of RM usage for a country were ultimately defined as its preference of RMs. Moreover, we used an agglomerative hierarchical clustering algorithm [23] to distinguish different country-level preference patterns.

4. Result Analysis

For the proportion of three types of RM, quantitative approach is observed as the most mainstream approach, which takes 82.29% records in the dataset. Compared to the qualitative approach, mixed approaches (i.e. 9.93%) turn to be slightly more common in DH research. Considering the increasingly growth of DH papers [21], it is believed that quantitative analysis is becoming a more and more important research means.

Specifically, the dominant position of the Western countries for DH studies is indisputable (seen Table 1). The United States, which has the most frequent use of RMs, has the most significant superiority compared to other countries. The United Kingdom and Germany are in the second place, especially Germany, which has a clear position of leadership in the qualitative type. The third-rank group are comprised of China, the Netherlands, Canada, Spain, Australia, and Spain. It is worth noting that as one of the few Asian countries on the list (except for Singapore and Israel), China’s outstanding performance in mixed and quantitative research is quite impressive, possibly due to its diversified use of RMs in the field of DH.

As a sign of quantitative approach, “experiment”based approaches are the most frequently utilized (seen in Figure 2). This can be inferred according to the visual analysis on the RM usage, because the relative rates of RM usage can reach 76.98%94.94%. Even if the samples are expanded to those countries with a usage frequency greater than 10, it also exceeds 60%. Thus, we temporarily exclude the RM of “experiment”.

According to Figure 3, theoretical approaches, as the most important qualitative methods, stand out from the remaining RMs. They are frequently used by most developed countries in Europe and America, such as the United States and the United Kingdom. Chinese DH scholars seem to show more keen interest in bibliometric methods, while the Americans prefer theoretical approaches. Both of these two countries show great attention to “interview”. “observation”, “transaction log analysis”, “research diary or journal”, and “focus group” are not highly valued by the mentioned countries. One or Two RMs are used in other lower-ranked countries, especially Finland, which is the only country with the most intensive preference for theoretical approaches. Relatively speaking, the choice of RMs is more evenly distributed for Canada, indicating that Canadian attitude towards qualitative methods may be more tolerant.

The clustering results of the RM preference is

shown in Figure 4. There are three clusters in the entire field of DH, namely #1 (United States), #2 (Germany, China, and United Kingdom), and #3 (Other Countries).

For the four RMs including “Experiment”, “Theoretical approach”, “Others”, and “Interviews” (i.e, ETOI approaches), the above group division based on the machine learning macroscopically and clearly provide informative results for different national preference level of RMs. #1 is the group with the strong preference of ETOI approaches. #2 and #3 are the medium-level and weak-level preference groups, respectively. Furthermore, there are some difference features among the three groups. Content analysis is heavily weighted in #1. #2 are more likely to use bibliometric analysis in DH scholarship. By contrast, #3 focuses on observational methods. The difference is not only related to the comprehensive performance of each cluster, but is also greatly influenced by the unique members in it whose preference polarity are quite overpowering, such as China’s preference (in #2) of bibliometric methods.

5. Conclusion

In this paper, we proposed an optimized iterative learning for RM extraction combing large language models and rule-based transformation to extract and classify RMs from a constructed DH dataset. We compared the differences in the preference use of RMs of different countries, which revealed the distinctive country-level preference patterns of DH. As a preliminary study, our findings can provide certain guidance and assistance for further improving the level of DH development.

Acknowledgements

This project is supported by the grant from National Natural Science Foundation of China (NO. 72204258).