<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Characterizing Emerging Technologies of Global Digital Humanities Using Scientific Method Entities⋆</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Shaojian Li</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Chengxi Yan</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Digital Humanities Research Center,Renmin University of China</institution>
          ,
          <addr-line>Beijing, China, 100872</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>School of Information Resource Management,Renmin University of China</institution>
          ,
          <addr-line>Beijing, China, 100872</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>Emerging technologies support the evolvement of disciplines. Scientific method entities, as proxies of emerging technologies, provide a framework for the development of emerging technologies. Therefore, identifying and extracting scientific method entities is an important link in the study of emerging technologies. The field of digital humanities is inherently interdisciplinary, combining traditional humanities disciplines with digital tools and technologies. Thus, it is particularly important for scholars in digital humanities to stay up-todate with emerging -technologies, as they have the potential to transform the way that we approach research and scholarship. However, there are still some problems for extracting and evaluating the emerging technologies, peculiarly in the field of digital humanities. To address these issues, this paper proposes an AI-based method to automatically extract scientific method entity, and also deeply analyzed the specific situation of emerging technologies in the field of digital humanities.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Emerging technologies</kwd>
        <kwd>Scientific method entity</kwd>
        <kwd>Digital humanities</kwd>
        <kwd>Entity extraction and evaluation</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        The emerging technology (ET) refers to those
technical innovations which represent progressive
developments within a field for competitive
advantage [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], they are science-based innovations
that have the potential to create a new industry or
transform an existing one [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. On this basis, we
believe that ETs are dominant innovative
technologies or methods emerging in a specific
field in a specific period. Therefore, studying ETs
can quickly understand the development of a
certain research field, especially some emerging
interdisciplinary fields, such as digital humanities
(DH). The scientific method entity (SME) is an
extensively researched object in various research
fields. As good proxies for ETs, they provide an
objective and systematic approach to
characterizing ETs of global DH.
      </p>
      <p>However, there are two main issues in existing
research. First, traditional methods of extracting
SMEs mostly use topic models and some manual
methods [3]. But the topic terms are usually
general, they may not have a specific meaning and
not enough to be explained. Also, manual
methods are relatively labor-intensive, especially
in the age of information explosion, this is not a
sustainable way. Second, the previous studies
about DH only focus on the landscape of
knowledge topics and structures in DH, while the
detailed features of ETs (e.g. knowledge
distribution, temporal evolution, etc.) are still
unknown. Specifically, we utilized specific SMEs
extracted from DH documents to represent ETs
instead of topics, and give a deep analysis of ETs
based on their bibliometric relationships. The
contribution of our research are two-fold: One is
a newly-designed approach based on
AIenchanced algorithm to automatically extract
SMEs in DH domain; The other is a feature
analysis on knowledge patterns related to ETs in
DH in both static and dynamic ways.</p>
      <p>Therefore, we explore two research questions:
RQ1: How to identify and extract ETs.</p>
      <p>RQ2: How is the distribution and evolution of
ETs in the domain of DH?</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related work</title>
      <p>
        Currently, the existing work on extraction and
evaluation of knowledge entities has received
widespread attention [4]. There are four
frequently utilized methods in method entity
extraction: manual annotation, rule-based
extraction, traditional machine learning, and
deep learning [
        <xref ref-type="bibr" rid="ref3">5</xref>
        ]. Each method has its own prons
and cons. Manual annotation is precise but
inefficient. Rule-based extraction has high data
processing ability but lack fexibility. Statistical
machine learning is more fexible but relies on
feature engineering. Deep learning method has
strong versatility, but needs training corpus.
      </p>
      <p>
        To reveal the intellectual structure of DH,
previous relevant studies focused on the task of
topic extraction, in which bibliometric analysis
approaches are frequently used. For example,
Tang used the TF-IDF algorithm to identify those
research topics with higher discriminative value
based on author assigned keywords [
        <xref ref-type="bibr" rid="ref4">6</xref>
        ]. In Wang'
research [
        <xref ref-type="bibr" rid="ref5">7</xref>
        ], the keywords co-occurrence
network can macroscopically present the
distribution of hot topics in DH, where each one
is regarded as a group of interrelated descriptors.
Similar to Wang, Su et al. have further expanded
the sources of topic candidates that are not
limited to keywords but rather representative
terms from titles and abs
      </p>
      <p>
        tracts, making the results of domain topic
analysis more comprehensive. It is clearly
observed that the recognized topic terms turn out
to be a collection of general hot concepts for DH
[
        <xref ref-type="bibr" rid="ref6">8</xref>
        ]. The specific pattern of ETs in the DH field is
still abstract and unclear, especially due to the
extracted high-frequency words that do not have
detailed meaning such as "digital humanities",
"cooperation", "communication" and "data".
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
    </sec>
    <sec id="sec-4">
      <title>3.1. Main Framework</title>
      <p>Based on the above analysis, we propose a
solution, which includes two parts, namely the
ET extraction and the ET analysis. The ET
extraction part includes extracting SMEs
candidate words through AI-based algorithm,
and then perform lemmatization, stem extraction
and filtering on them, and finally obtain ET units.
The ET analysis part is mainly about the SMEs
dictionary obtained after ET units are mapped to
the DH collection, and the SMEs’ distribution,
the clustering based on ETs co-occurrence and
the evolution based on word frequency.</p>
    </sec>
    <sec id="sec-5">
      <title>3.2. AI-based Extraction of SMEs</title>
      <p>To overcome the shortcomings of feature
engineering-based topic extraction and rule-based
recognition of scientific method knowledge, we
propose an AI-empowered semi-automatic
extraction method (ASAEM). This optimized
approach is essentially a two-stage pipeline
procedure. In the first stage, instead of manual
judgement, a state-of-the-art super language
model “ChatGPT”1 is used to process documents
and derive method-related entity candidates
which are the fundamental units to build a
dictionary of method entities. Since the key of
ASAEM is to determine the most suitable prompt,
an efficient automatic detection mechanism of the
optimal template is designed in the algorithm. In
the second stage, we leverage the above dictionary
to match and extract all the method entities from
DH collections whilst excluding a small amount
of general high-frequency terms, which can help
to significantly improve the algorithmic recall.
Figure 2 presents the pseudo-code of ASAEM.</p>
    </sec>
    <sec id="sec-6">
      <title>3.3. Detection and Clustering of ETs in DH</title>
      <p>We extracted a total of 24,306 SME candidates
from DH collections, then screened them with the
rule that the word length is less than 40 and the
word frequency is greater than 3, and finally got
846 SMEs. Then we use the WordNetLemmatizer
toolkit in NLTK to restore the lemmatization of
1 https://openai.com/blog/chatgpt
these SMEs in order to remove the problem that
the same word but different morphological forms
of words are generated due to the number, tense,
voice, etc.</p>
      <p>We then proceeded to use NLTK to stem these
re-morphed words for better mapping to the
original text. In the next step, we matched some
words based on pattern matching rules, such as
abbreviations, synonyms, etc., and filtered some
meaningless words then we got many ET unites
and we use them to map the content of each DH
collection and get the SME dictionary. After
doing so, we get the distribution of ETs in the DH
from 1903 to 2022. We mainly performed ETs
cooccurrence clustering analysis and word
frequency-based evolution over time on SMEs.
We take relative strength index Equivalence
coefficient as weights for edges and word
frequency for nodes, importing them into gephi
and use modular clustering to get clustering
results.</p>
    </sec>
    <sec id="sec-7">
      <title>4. Empirical Studies</title>
    </sec>
    <sec id="sec-8">
      <title>4.1.Dataset</title>
    </sec>
    <sec id="sec-9">
      <title>Details and</title>
    </sec>
    <sec id="sec-10">
      <title>Implementation</title>
      <p>Considering the interdisciplinary characteristics
of DH, we first conducted a preliminary
exploration of the data source for DH documents.
Our goal is to obtain as many relevant documents
as possible, which means it must not only require
the largest quantity but also abundant document
types. After comparing the number of retrieved
papers from three well-known databases (i.e. Web
of Science Database, Crossref Database and
Dimensions Database), the Dimensions Database
is selected to collect data. The query (digital
humanit* OR humanit* comput* OR ehumanit*
OR e-humanit*) is adopted to search the field of
title, keywords and abstract, which yields 4398
documents. Through the removal of duplicate and
unrelated documents, we finally obtain 3469 ones
as the initial "target set". The descriptive
statistical for the set can be seen in Figure 3.</p>
      <p>We can see that on the one hand, compared to
the number of documents, the number of
DHrelated terms showing an upward trend (before
2020) seems to be at a much larger order of
magnitude, which implies possible significant
errors of direct term-based extraction of ET. On
the other hand, the types of documents in our
chosen dataset are relatively rich, and can almost
cover various records in the DH field.
4.2</p>
    </sec>
    <sec id="sec-11">
      <title>Result Analysis</title>
    </sec>
    <sec id="sec-12">
      <title>4.2.1. Distribution and Clustering of ETs</title>
      <p>
        Many systems of scientific interest can be
represented as networks, sets of nodes or vertices
joined in pairs by lines or edges. Many networks
of interest in the sciences are found to divide
naturally into communities or modules. The
problem of detecting and characterizing this
community structure is one of the outstanding
issues in the study of networked systems [
        <xref ref-type="bibr" rid="ref7">9</xref>
        ]. We
performed a co-ET cluster analysis on the ET
distribution results. If the word pair (Wi,Wj) does
not co-occur in the document collection, the direct
correlation strength Ei,j is counted as 0. If there is
a co-occurrence relationship, this paper uses the
relative strength index Equivalence coefficient
[
        <xref ref-type="bibr" rid="ref8">10</xref>
        ] to evaluate the word pair frequency. Inclusive
processing, the multi-valued matrix is converted
into a correlation matrix form with element values
between [
        <xref ref-type="bibr" rid="ref1">0,1</xref>
        ], as shown in the following Formula.
1 :
Among them, Gi,j represents the co-occurrence
times of keywords Wi and Wj in the same
document, and Gi, Gj represent the total frequency
of keywords Wi and Wj respectively. The
calculation results are used for modular clustering
of method entity distributions in the DH literature,
acting as weights for edges established between
ETs’ network.
      </p>
      <p>From the clustering results of ETs, we can see
that DH is a typical interdisciplinary, which
contains very diverse ETs, including all aspects of
data processing. (i.e. data collection, data
extraction, data coding, data classification, data
analysis, data mining and data visualization, etc.)
In particular, some of the core SMEs in ET
clustering, such as data presentation, data
classification, pattern recognition, etc., are clearly
the preference methods in the entire DH global
community. What’s more, some DH research
methods come from other disciplines, such as
distant reading and near reading in the field of
literature, gender studies and feminism, and
archival studies, this is a powerful testament to the
fact that DH is an interdisciplinary field. In
addition, there are also some applications of ETs
of artificial intelligence such as machine learning,
which reflects the great role of AI in promoting
the development of DH domain.</p>
    </sec>
    <sec id="sec-13">
      <title>4.2.2. Top10 SMEs in different periods</title>
      <p>As shown in Figure 5, most of the TOP 10 SMEs
in the three stages are the same, such as
geographic information systems, humanities
research, artificial intelligence, history studies,
retrieval, textual analysis, technological
innovation, etc. are all in TOP 10 in the three
stages, it shows that some methods have been</p>
    </sec>
    <sec id="sec-14">
      <title>4.2.3. The evolution of Top 10 SMEs</title>
      <p>From Figure 6, we can see that the top 10 SMEs
in the DH roughly experienced a general trend of
rising first and then falling. From a macro
perspective, the evolution of SMEs in the DH
field can be roughly divided into three stages.
First is the initial stage, which spans from 1903 to
2000. SMEs at this stage basically has been
almost no growth, mainly due to the lacking of
papers at this time. Second is the vigorous
used throughout the development of DH and have
not been changed. These are the most widely used
and mature methods in DH and represents that
ETs of DH have some basic methods. In addition,
there are also some method entities have a
temporal phase. For example, social science only
appeared in the stage 1 and 2, but it is no longer
top 10 in the stage 3, while information science
and cultural analysis are not significant in stage 1
and 2, but in stage 3, it has entered top10, which
shows the transformation of DH's ETs. What’s
more, geographic information systems are almost
in a dominant position in any period, which shows
that DH's ETs has always attached importance to
the integration with geographic methods.
expansion stage, from 2000 to 2020. During this
period many things happened to promote the
development of DH, such as the release of the
Digital Humanities Manifesto, the inauguration of
the inauguration of the Digital Humanities
Quarterly, the Global Digital Humanities Annual
Conference, the development of Digital
Humanities Education, etc. are proofs of its rapid
development. But starting in 2020, they all
showed a downward trend, which may be due to
the impact of the COVID-19 epidemic and the
international situation. From a microscopic point
of view, geographic information systems (GIS),
digital analysis, and humanities research have
always had a high frequency. Especially GIS has
always maintained the highest frequency,
reflecting the importance that DH scholars attach
to this method. What is more interesting is that AI
is also valued by digital humanities, which shows
that AI technology is indeed an important help for
DH research. Although the top ten SMEs like AI
are all quantitative analysis methods, while
traditional humanities research methods, such as
history studies, cultural analysis, which are
mainly based on qualitative research is also the
main focus.</p>
    </sec>
    <sec id="sec-15">
      <title>5. Discussion and Conclusion</title>
      <p>DH's ETs demonstrate that DH is an
interdisciplinary field of study involving multiple
disciplines. ETs in the field of DH are mainly
related to data processing, such as data collection
and extraction, data mining and analysis, data
visualization and presentation, etc. While many
ETs involve the intersection with humanities and
social sciences, such as history, literature,
sociology, art, etc. The characteristic of
interdisciplinary ETs is the guarantee of the
vigorous vitality of DH. DH's ETs have the
characteristics of persistence. Whether looking at
the TOP 10 SEMs in different periods or looking
at the overall TOP 10 SMEs, they all have a high
degree of convergence. Some methods have been
used all the time and have not changed over time.
But at the same time, the ETs of DH also have the
characteristics of partial shift. For example, in the
early stage, social science was more important,
and this emphasis turned to information science
and cultural analysis in the later stage. DH's ETs
have always maintained the characteristics of a
combination of digital and humanities, and there
is no bias towards one side that leads to imbalance.</p>
      <p>Observing the evolution trend, we can find that
the evolvement paths of digital technologies and
humanities are completely synchronized, which
shows that humanities have always been valued in
the field of DH, while It’s not that DH, as some
scholars say, emphasizes technology over
humanities. That is to say, there is no saying that
DH should return to humanities in the future,
because they have always been the focus. This is
a strong proof of the healthy development of this
discipline. In addition, a large majority of ETs are
related to artificial intelligence which is an
important embodiment of the word digital in DH.
While many ETs involve the intersection with
humanities and social sciences, such as history,
literature, sociology, art, etc., which is an
important embodiment of the word humanities. In
addition, the application of AI has greatly
promoted the development of DH.</p>
    </sec>
    <sec id="sec-16">
      <title>6. Acknowledgements</title>
      <p>This work is supported by the Fundamental
Research Funds for the Central Universities, and
the Research Funds of Renmin University of
China. (Grant No.23XNH150).</p>
    </sec>
    <sec id="sec-17">
      <title>7. References</title>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <article-title>[1] Innovation and technology-strategies and</article-title>
          <string-name>
            <surname>policies</surname>
          </string-name>
          [M]. Springer Science &amp; Business
          <string-name>
            <surname>Media</surname>
          </string-name>
          ,
          <year>1997</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Day</surname>
            ,
            <given-names>G.S.</given-names>
          </string-name>
          , and
          <string-name>
            <given-names>P.J.H.</given-names>
            <surname>Schoemaker</surname>
          </string-name>
          .
          <year>2000</year>
          .
          <article-title>A different game</article-title>
          .
          <source>InWharton on managing emerging technologies</source>
          , ed. G.S.
          <string-name>
            <surname>Dayand P.J.H. Schoemaker</surname>
          </string-name>
          ,
          <volume>1</volume>
          -
          <fpage>23</fpage>
          .New York: John Wiley [
          <fpage>3</fpage>
          -4]
          <string-name>
            <surname>Wang</surname>
            <given-names>Y</given-names>
          </string-name>
          , Zhang C.
          <article-title>Using the full-text content of academic articles to identify and evaluate algorithm entities in the domain of natural language processing</article-title>
          [J].
          <source>Journal of informetrics</source>
          ,
          <year>2020</year>
          ,
          <volume>14</volume>
          (
          <issue>4</issue>
          ):
          <fpage>101091</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          &amp;
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <article-title>A review on method entities in the academic literature: extraction, evaluation, and application</article-title>
          .
          <source>Scientometrics</source>
          <volume>127</volume>
          ,
          <fpage>2479</fpage>
          -
          <lpage>2520</lpage>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Tang</surname>
            , M. C., Cheng,
            <given-names>Y. J.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>K. H.</given-names>
          </string-name>
          (
          <year>2017</year>
          ).
          <article-title>A longitudinal study of intellectual cohesion in digital humanities using bibliometric analyses</article-title>
          .
          <source>Scientometrics</source>
          ,
          <volume>113</volume>
          (
          <issue>2</issue>
          ),
          <fpage>985</fpage>
          -
          <lpage>1008</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          (
          <year>2018</year>
          ).
          <article-title>Distribution features and intellectual structures of digital humanities: A bibliometric analysis</article-title>
          .
          <source>Journal of Documentation.</source>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Su</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Immel</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          (
          <year>2020</year>
          ).
          <article-title>Digital humanities research: interdisciplinary collaborations, themes and implications to library and information science</article-title>
          .
          <source>Journal of Documentation.</source>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Newman M E J.</surname>
          </string-name>
          <article-title>Modularity and community structure in networks[J]</article-title>
          .
          <source>Proceedings of the national academy of sciences</source>
          ,
          <year>2006</year>
          ,
          <volume>103</volume>
          (
          <issue>23</issue>
          ):
          <fpage>8577</fpage>
          -
          <lpage>8582</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Callon</surname>
            <given-names>M</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Courtial J P</surname>
            , Laville
            <given-names>F.</given-names>
          </string-name>
          <article-title>Co-word analysis as a tool for describing the network of interactions between basic and technological research: The case of polymer chemsitry[J]</article-title>
          .
          <source>Scientometrics</source>
          ,
          <year>1991</year>
          ,
          <volume>22</volume>
          :
          <fpage>155</fpage>
          -
          <lpage>205</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>