=Paper= {{Paper |id=Vol-3745/paper9 |storemode=property |title=Revealing the Country-level Preference on Research Methods in the Field of Digital Humanities: From the Perspective of Library and Information Science |pdfUrl=https://ceur-ws.org/Vol-3745/paper9.pdf |volume=Vol-3745 |authors=Chengxi Yan,Zhichao Fang |dblpUrl=https://dblp.org/rec/conf/eeke/YanF24 }} ==Revealing the Country-level Preference on Research Methods in the Field of Digital Humanities: From the Perspective of Library and Information Science== https://ceur-ws.org/Vol-3745/paper9.pdf
                         Revealing the Country-level Preference on Research Methods in
                         the Field of Digital Humanities: From the Perspective of Library
                         and Information Science
                          Chengxi Yan 1,*, Zhichao Fang 2

                         1 School of Information Resource Management, Renmin University of China, Beijing, China, 100872
                         2 Digital Humanities Research Center, Renmin University of China, Beijing, China, 100872



                                             Abstract
                                             Research method is a very important element for both individual scientific research and
                                             country technological development, especially for those interdisciplinary fields like digital
                                             humanities (DH) that is close to library and information science (LIS). Considering the
                                             scarcity of relevant training data, this study proposes a multi-stage recognition algorithm
                                             combining large language model and iterative learning strategy to automatically extract
                                             method mentions from DH scientific documents. According to the taxonomy of RMs in
                                             existing LIS research, we used dictionary-based mapping technology to transform these
                                             entities into RMs and their types. To clarify the differences in RM preferences across
                                             different countries, we identified the countries and established the relationship between
                                             them with the RMs. A clustering model was utilized to detect country-level RM preference.
                                             The experiments showed that quantitative research has played an increasingly central role
                                             in the international DH field, especially the experimental methods. Also, there is a
                                             distinctive distribution for RM preference among different countries.
                                             Keywords
                                             Bibliometric analysis, Research methods, Entity recognition, Digital humanities

                                                                                                                   comparative analysis of preference variation on RMs
                         1. Introduction                                                                           between countries will be conducive to a more
                                                                                                                   systematic and efficient evaluation of national
                            For the majority of scientific researchers,                                            scientific strength and innovative ability. Moreover, it
                         identifying and understanding the research                                                promotes the country-level awareness of the strengths
                         methods (RMs) in different scientific fields is not                                       and weaknesses in both international academic
                         only a necessary academic basic skill, but also a                                         collaboration and competition. With the rapid
                         significant reference for deeply getting the whole                                        development of entitymetrics-based approaches [3],
                         picture of its development or solving domain                                              the identification and measurement of RMs has
                         problems [1]. As the Stanford Encyclopedia of                                             become one of the hot research issues, especially for
                         Philosophy defined RM as “the means of how the                                            some interdisciplinary fields that integrate a large
                         aims and products of science are achieved, which                                          number of different technologies and methods such as
                         should be distinguished from meta-methodology                                             rule-based and deep neural network-based methods.
                         and the detailed and contextual practices” [2].                                           However, it remains highly challenging for accurately
                            The distinct characteristics of scientific                                             identifying all different types of RMs, due to the
                         approaches, technical standards and application                                           limitation of training corpus annotated by RM-related
                         norms can be reflected on the different use of RMs                                        entities for supervised models and low prediction
                         across various countries. Therefore, the                                                  performance in the unsupervised way.
                         Joint Workshop of the 5th Extraction and Evaluation of Knowledge Entities
                         from Scientific Documents and the 4th AI + Informetrics (EEKE-AII2024),
                         April 23~24, 2024, Changchun, China and Online. ∗ Corresponding author.
                            0000-0003-1128-550X (C. Yan); 0000-0002-3802-2227 (Z. Fang);
                         EMAIL: 20218113@ruc.edu.cn (C. Yan); fzc0225@163.com (Z. Fang)
                                        © Copyright 2024 for this paper by its authors. Use permitted under
                                        Creative Commons License Attribution 4.0 International (CC BY 4.0).




CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings

                                                                                                              80
   In addition, most previous studies analyzed the           3. Research Design
usage frequency of RMs in the field of library and
information science (LIS), which ignored hot                 To answer the RQs, we proposed a research analytical
interdisciplinary fields related to LIS like digital         framework, as shown in Figure 1, including three main
humanities (DH) and the difficulty of their RM               steps. The latter two steps are the most crucial
classification. As a research area that is inherently        components.
methodological and heavily indebted to LIS [4],
DH is often viewed as a “big tent” [5] including
different disciplines with an extensive range of
RMs. Considering the interdisciplinary nature of
DH and the close relationship between it and LIS,
this study adopted DH as the analytical object.
According to it, three research questions (RQs)
are proposed as the following: RQ1: From a
global perspective, does DH research tend to be
qualitative, quantitative, or mixed and which
countries are the typical representation for these
three method types? RQ2: What are the
differences in the preference of RMs among
different countries? RQ3: Is there a certain pattern
for the country-level preference of RMs?

2. Related Work                                              Figure 1: The entire research framework.


    The approaches of automatic recognition for              3.1 Construction of research dataset
RM entities can be divided into two main stages,
namely rule-based [6-8] and machine learning-                    To obtain the original scientific DH papers, similar
based technology. For example, Zha adopted the               to previous studies [20, 21], we used the subject term-
abbreviation patterns and regular expressions to             based query strategy (including titles, abstracts and
extract candidate algorithmic entities [6].                  keywords) as (“digital humanit*” OR “humanit*
Considering the weakness of recognition                      comput*” OR “ehumanit*” OR “electronic* humanit*”
performance, more researchers turned to the                  OR “e-humanit*”) in three well-known databases (Web
approaches of machine learning [9-13]. In Zhang              of Science Database, Crossref Database and
et al.’ study [9], software entities from the PLOS           Dimensions Database, DD) to search as many relevant
ONE full texts were identified and reorganized               documents as possible. The publication timespan is set
into five different groups using a clustering                between 1900 and 2021. According to the comparative
algorithm. Wang et al. constructed a term function           results, we found that DD almost covered all the records
identification model based on the deep learning              from the other two datasets mentioned above (mainly
(DL) [10].                                                   journal articles), and more importantly had a wide
    The classification of RMs can be traced back             range of source types such as books, proceeding or
to the early study based on the content analysis in          preprint papers, and monographs. Thus, the DD
the LIS field, such as Jarvelin et al.’s systematic          database was selected as the source for the acquisition
categorization [14]. Hider [15], Kumpulainen [16],           of dataset. There was a total of 4398 articles in the
and other LIS scientists who adopted and                     initial dataset. Next, we deduplicated and deleted
optimized Jarvelin’s classification theory of RMs            irrelevant document records from it, finally resulting in
further reported on the use of RMs in the long-              3469 papers.
term evolution of the LIS field. One of the most                 To identify the country names in each paper, we
influential studies was done by Chu and Ke [17],             utilized a huge global database called “GRID (Global
in which three representative LIS journals were              Research Identification Database)”, which is one of the
coded computed and analyzed, yielding 16 RMs.                most popular open repositories of authoritative research
This classification scheme has promoted a variety            institutions. We used GRID as an institutional
of development for RMs, such as the influence                dictionary to link the institution entities where the
analysis of algorithmic entities [7], the                    authors (in DD records) are located in to its
exploration of dynamic evolution of RMs in the               corresponding countries. The processed dataset
Chinese LIS field [18], and the survey of RMs in             consists of 1915 papers.
the practice projects [19].
                                                        81
3.2 Automatic extraction of method                           Moreover, we used an agglomerative hierarchical
                                                             clustering algorithm [23] to distinguish different
    entities                                                 country-level preference patterns.

    Given to the linguistic complexity (e.g.
contextual features of method entities) in DH
                                                             4. Result Analysis
documents with dual humanities and technological
aspects, the identification of RMs may confront                 For the proportion of three types of RM, quantitative
higher technical difficulties. We proposed a three-          approach is observed as the most mainstream approach,
stage method for automatic entity extraction.                which takes 82.29% records in the dataset. Compared
Firstly, we constructed prompt-based templates               to the qualitative approach, mixed approaches (i.e.
using a large language model (GPT-3.5) to                    9.93%) turn to be slightly more common in DH
complete zero-shot learning, which generated a               research. Considering the increasingly growth of DH
coarse-grained annotation results of method                  papers [21], it is believed that quantitative analysis is
entities. Secondly, a vocabulary containing normal           becoming a more and more important research means.
method terminologies and their variations (e.g.                 Specifically, the dominant position of the Western
abbreviation, synonyms) were built through                   countries for DH studies is indisputable (seen Table 1).
manual collection and multiple rounds of expert              The United States, which has the most frequent use of
evaluation. Inspired by Gupta and Manning’s                  RMs, has the most significant superiority compared to
work [8], we next designed an iterative learning             other countries. The United Kingdom and Germany are
process to identify and correct method entities on           in the second place, especially Germany, which has a
the above resulting dataset in the human-in-the-             clear position of leadership in the qualitative type. The
loop way, where the rule-based transformation and            third-rank group are comprised of China, the
classification for RMs were performed. During the            Netherlands, Canada, Spain, Australia, and Spain. It is
process of RM conversion, each method entity was             worth noting that as one of the few Asian countries on
automatically “translated” to a regular RM in the            the list (except for Singapore and Israel), China’s
Chu and Ke’s taxonomy [17] if a rule is matched.             outstanding performance in mixed and quantitative
    Considering the pattern of RMs to be centrally           research is quite impressive, possibly due to its
observed and induced in the field of DH, we                  diversified use of RMs in the field of DH.
divided them into three categories in the wider
                                                             Table 1
scope in the light of Jarvine et al.’ research [14],         The ranking of total number of RM types used by the top 5
namely qualitative research, quantitative research,          countries. Note: US, UK, NL, GER and SGP is for short of United
and mixed research. For instance, qualitative                States, United Kingdom, Netherlands and Germany and Singapore,
research      includes       “content      analysis”,        respectively.
“ethnography and field study”, “historical                      Rank          mixed         qualitative       quantitative
method”,      “interview”      and    etc.,    while             1           US (24)         US (29)            US (166)
representative      quantitative    research      are            2           UK (10)         GER (7)           GER (106)
“experiment”, “think aloud protocol” and                         3           China (9)        UK (7)            UK (86)
                                                                 4            NL (7)        Australia (5)      China (48)
“transaction log analysis”, “ bibliometrics”. A
study was judged to be “mixed” only when both
types of RMs (at least one) are used.

3.3 National preference of RMs
    For each DH record, we used the regular
expression to match all authors and their
institutions. A simple program was then designed
to map them to the relevant countries based on the
organizational names in the GRID database.
Considering the issue of multi-path relationship
between records and countries, we calculated it
according to [22], in which a country used a
method once when relevant RMs were mentioned
in a paper regardless the occurring frequency of
countries that correspond to its authors [22]. The           Figure 2: The statistics of national preference of RMs.
cumulative counts of RM usage for a country were
ultimately defined as its preference of RMs.                   As a sign of quantitative approach, “experiment”-
                                                             based approaches are the most frequently utilized (seen
                                                        82
in Figure 2). This can be inferred according to the
visual analysis on the RM usage, because the
relative rates of RM usage can reach 76.98%-
94.94%. Even if the samples are expanded to
those countries with a usage frequency greater
than 10, it also exceeds 60%. Thus, we
temporarily exclude the RM of “experiment”.
   According to Figure 3, theoretical approaches,
as the most important qualitative methods, stand
out from the remaining RMs. They are frequently
used by most developed countries in Europe and
America, such as the United States and the United
Kingdom. Chinese DH scholars seem to show
                                                              Figure 4: Different clusters of countries based on RM-related
more keen interest in bibliometric methods, while             preference.
the Americans prefer theoretical approaches. Both
of these two countries show great attention to                   For the four RMs including “Experiment”,
“interview”. “observation”, “transaction log                  “Theoretical approach”, “Others”, and “Interviews”
analysis”, “research diary or journal”, and “focus            (i.e, ETOI approaches), the above group division
group” are not highly valued by the mentioned                 based on the machine learning macroscopically and
countries. One or Two RMs are used in other                   clearly provide informative results for different
lower-ranked countries, especially Finland, which             national preference level of RMs. #1 is the group with
is the only country with the most intensive                   the strong preference of ETOI approaches. #2 and #3
preference for theoretical approaches. Relatively             are the medium-level and weak-level preference
speaking, the choice of RMs is more evenly                    groups, respectively. Furthermore, there are some
distributed for Canada, indicating that Canadian              difference features among the three groups. Content
attitude towards qualitative methods may be more              analysis is heavily weighted in #1. #2 are more likely
tolerant.                                                     to use bibliometric analysis in DH scholarship. By
                                                              contrast, #3 focuses on observational methods. The
                                                              difference is not only related to the comprehensive
                                                              performance of each cluster, but is also greatly
                                                              influenced by the unique members in it whose
                                                              preference polarity are quite overpowering, such as
                                                              China’s preference (in #2) of bibliometric methods.

                                                              5. Conclusion
                                                                 In this paper, we proposed an optimized iterative
                                                              learning for RM extraction combing large language
                                                              models and rule-based transformation to extract and
                                                              classify RMs from a constructed DH dataset. We
                                                              compared the differences in the preference use of RMs
                                                              of different countries, which revealed the distinctive
                                                              country-level preference patterns of DH. As a
                                                              preliminary study, our findings can provide certain
                                                              guidance and assistance for further improving the level
                                                              of DH development.

Figure 3: Preference choices of RMs measured by usage         Acknowledgements
frequency (up) or usage ratio (down) in representative
countries.
                                                                This project is supported by the grant from National
  The clustering results of the RM preference is              Natural Science Foundation of China (NO. 72204258).
shown in Figure 4. There are three clusters in the
entire field of DH, namely #1 (United States), #2
(Germany, China, and United Kingdom), and #3
(Other Countries).


                                                         83
References                                               [12] Zhang, C., Tian, L., and Chu, H. (2023). Usage
                                                              frequency and application variety of research
[1] Zhang, H., Zhang, C. (2021). Using Full-text              methods in library and information science:
     Content of Academic Articles to Build a                  Continuous investigation from 1991 to 2021.
     Methodology Taxonomy of Information                      Information Processing & Management, 60(6),
     Science in China. Knowledge Organization, 48,            103507.
     2: 126-139.                                         [13] Boland, K., and Krüger, F. (2019). Distant
[2] Hepburn, B., & Andersen, H. (2021). Scientific            supervision for silver label generation of software
     Method. In E. N. Zalta (Ed.), The Stanford               mentions in social scientific publications.
     Encyclopedia of Philosophy (Summer 2021).                Proceedings of the 4th Joint Workshop on
     Metaphysics      Research      Lab,    Stanford          Bibliometric-enhanced Information Retrieval and
     University.                                              Natural Language Processing for Digital Libraries,
[3] Ding, Y., Song, M., Han, J., et al. (2013).               15-27.
     Entitymetrics: Measuring the impact of entities.    [14] Jarvelin, K., Vakkari, P. (1990). Content analysis
     PloS One, 8(8): e71416.                                  of research articles in library and information
[4] Poole, A. H. (2017). The conceptual ecology of            science. Library & Information Science Research,
     digital humanities. Journal of Documentation,            12, 395–421.
     73(1), 91-122.                                      [15] Hider, P., & Pymm, B. (2008). Empirical research
[5] Jockers, M. and Worthey, G. (2011),                       methods reported in high-profile LIS journal
     Introduction: welcome to the big tent,                   literature. Library & Information Science
     Proceeding of Digital Humanities 2011                    Research, 30, 108–114.
     Conference, 6-7.                                    [16] Kumpulainen, K. (1991). Library and information
[6] Zha, H., Chen, W., Li, K., & Yan, X. (2019).              science research in 1975. Libri, 41(1), 59–76.
     Mining Algorithm Roadmap in Scientific              [17] Chu, H., Ke, Q. (2017). Research methods: What’s
     Publications, Proceedings of the 25th ACM                in the name?. Library & Information Science
     SIGKDD International Conference on                       Research, 39(4), 284–294.
     Knowledge Discovery & Data Mining, 1083-            [18] Lou, W., Su, Z., He, J. et al. (2021). A temporally
     1092.                                                    dynamic examination of research method usage in
[7] Wang, Y., Zhang, C. (2020). Using the full-text           the Chinese library and information science
     content of academic articles to identify and             community.        Information      Processing     &
     evaluate algorithm entities in the domain of             Management, 58(5), 102686.
     natural language processing. Journal of             [19] Lund, B. D., Wang, T. (2021). An analysis of
     informetrics, 14(4), 1-21.                               research methods utilized in five top, practitioner-
[8] Gupta, S., Manning, C. D. (2011). Analyzing               oriented LIS journals from 1980 to 2019. Journal
     the dynamics of research by extracting key               of Documentation, 77(5), 1196-1208.
     aspects of scientific papers. Proceedings of 5th    [20] Tang, M. C., Cheng, Y. J. and Chen, K. H. (2017).
     international joint conference on natural                A longitudinal study of intellectual cohesion in
     language processing, 1-9.                                digital humanities using bibliometric analyses.
[9] Zhang, H., Ma, S., and Zhang, C. (2019). Using            Scientometrics, 113(2), 985-1008.
     Full-text of Academic Articles to Find Software     [21] Su, F. and Zhang, Y. (2021). Research output,
     Clusters. Proceedings of ISSI, 2776-2777.                intellectual structures and contributors of digital
[10] Wang, J., Cheng, Q., Lu, W., et al. (2023). A            humanities research: a longitudinal analysis 2005–
     term function–aware keyword citation network             2020. Journal of Documentation, 78(3), 673-695.
     method for science mapping analysis.                [22] Sidone, O. J. G., Haddad, E. A. and Mena-Chalco,
     Information Processing & Management, 60(4),              J. P. (2017). Scholarly publication and
     103405.                                                  collaboration in Brazil: The role of geography.
[11] Zhang, H., Zhang, C., and Wang, Y. (2024).               Journal of the Association for Information Science
     Revealing the technology development of                  and Technology, 68(1), 243-258.
     natural language processing: A Scientific           [23] Sasirekha, K. and Baby, P. (2013). Agglomerative
     entity-centric     perspective.     Information          hierarchical clustering algorithm-a. International
     Processing & Management, 61(1), 103574.                  Journal of Scientific and Research Publications,
                                                              83(3), 83.

                                                    84