=Paper= {{Paper |id=Vol-2658/paper4 |storemode=property |title=Investigating Interdisciplinary Knowledge Flow from the Content Perspective of Citances |pdfUrl=https://ceur-ws.org/Vol-2658/paper4.pdf |volume=Vol-2658 |authors=Jin Mao,Shiyun Wang,Xianli Shang |dblpUrl=https://dblp.org/rec/conf/jcdl/MaoWS20 }} ==Investigating Interdisciplinary Knowledge Flow from the Content Perspective of Citances== https://ceur-ws.org/Vol-2658/paper4.pdf
                            EEKE 2020 - Workshop on Extraction and Evaluation of Knowledge Entities from Scientific Documents




       Investigating interdisciplinary knowledge flow from the content
                            perspective of citances
                       Jin Mao                                     Shiyun Wang                                     Xianli Shang†
     School of Information Management                   School of Information Management                       Business School
             Wuhan University                                   Wuhan University                     Xinyang Agriculture and Forestry Univ
           Wuhan, Hubei, China                                Wuhan, Hubei, China                           Xinyang, Henan, China
            maojin@whu.edu.cn                                 wangsy2@whu.edu.cn                              47655282@qq.com



ABSTRACT                                                                        feature of interdisciplinary research is the integration of
                                                                                knowledge from multiple disciplines out of the field [1]. Methods,
Interdisciplinary research is playing an important role in modern               theories, tools and concepts from different disciplines are often
science. In recent years, a lot of studies have measured                        integrated to solve complex research problems of interdisciplinary
interdisciplinary knowledge flow based on the frequency of                      research. To understand the characteristics of interdisciplinary
citations. However, this approach does not consider the content of              knowledge integration, citation analysis has often been used to
knowledge carried in the citations. In this study, we attempt to                examine knowledge flow among disciplines[2]. Conventionally,
investigate the content of knowledge flow towards an                            the knowledge flow to a field is simply measured by the number
interdisciplinary field by analyzing the citation sentences (i.e.,              of references cited by the papers in the field. Different importance,
citances ) in the articles of the field. An emerging field, eHealth,            motivations and many other aspects of citations in a paper are
is chosen in the case study. The associated knowledge phrases                   ignored.
between citances and the references of the field are identified and             Recent studies have shifted to investigate interdisciplinary
categorized to analyze the content and categories of knowledge                  knowledge flow from a finer-granular perspective by looking into
spread from the source disciplines to the field. The result shows               the content and contexts of citations. Citation contexts have
that the ranks of disciplines by the frequency of associated phrases            become more easily obtained in recent years, which embed the
are consistent with the ranks by the frequency of in-text citations.            syntactic (e.g., the location of section and rhetoric style) and
Distribution of associated phrases over categories and disciplines              semantic (e.g., the meaning of citation content) information of
is also analyzed. The associated phrases of research subject are                citations[3]. Citation contexts have been used to differentiate the
the most, followed by entity. This study contributes to the                     functions[4-5], importance[6] and knowledge contributions[7] of
understanding of content characteristics about interdisciplinary                different citations. The rich information of citation contexts
knowledge integration.                                                          enables the analysis on what knowledge is integrated into an
                                                                                interdisciplinary field.
CCS CONCEPTS                                                                    In this study, we attempt to explore the content of knowledge
• Theory of computation ~ Semantics and reasoning ~ Program                     integrated into an interdisciplinary field, eHealth, by analyzing the
semantics ~ Categorical semantics; • Information systems ~                      citances. The field of eHealth is an emerging field, referring to all
Information systems applications ~ Data mining; • Information                   aspects of the intersection of health care and the Internet[8]. A
systems ~ Information retrieval ~ Retrieval tasks and goals ~                   citance that provides the context of a citation is denoted as the
Information extraction                                                          sentence that contains in-text reference information. Our research
                                                                                questions are what knowledge is integrated from the source
KEYWORDS                                                                        disciplines to eHealth, and what types are the knowledge. In this
                                                                                study, we design an approach to analyze the content and
Interdisciplinary research, Content classification, eHealth, In-text
                                                                                categories of the knowledge shared between citances and the
reference, Knowledge integration                                                references. This study contributes to understanding the content
                                                                                characteristics of interdisciplinary knowledge integration.

1       Introduction
Interdisciplinary research has become an important research                     2   Methodology
paradigm and many recent significant breakthroughs in science
are the fruits of interdisciplinary research. One fundamental                   2.1 Data Collection
†
    Corresponding author.                                                       Two high impact eHealth journals, Journal of Medical Internet
                                                                                Research and JMIR mHealth and uHealth, were selected as our
                                                                                data sources. All 3,416 articles with XML files published from
                                                                                1999 to 2018 were collected. We only focused on the 3,221
                                                                                articles with the types of Original papers, Reviews and
                                                                                Viewpoints. The metadata of references, including title, citation
                                                                                type, journal name, DOI, PubMed ID, and publish year, were




          Copyright 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
                                                                           40
                    EEKE 2020 - Workshop on Extraction and Evaluation of Knowledge Entities from Scientific Documents


                                                                                                                                        J.Mao et al.

parsed from the XML files. Sentences were extracted by using the               an associated knowledge phrase is defined as a noun phrase
punctuations (periods, question marks, etc.) as sentence                       appearing in both a citance and its reference, which could be
boundaries, then citances with in-text references were identified.             regarded as the knowledge transferred from the reference to the
In total, 115,456 citances and 140,572 reference records were                  citing paper.
obtained.
To complete the abstracts of references, the reference records                 To analyze the types of the knowledge that flows to the eHealth
were fetched by searching PubMed for PubMed ID or Web of                       field, we designed a classification framework of associated
Science (WoS) for DOI. In total, the abstracts of 89,649 reference             knowledge phrases based on the previous studies [11-13]. Two
records were collected.                                                        graduate students familiar with the field of eHealth were recruited
                                                                               to annotate the categories of the associated knowledge phrases by
2.2    Source Discipline Identification                                        following the steps:
To explore the source of input knowledge, the references were
then categorized into the 22 disciplines of Essential Science
                                                                               1.     Initializing knowledge classification framework. One author
Indicators (ESI). We used the 2018 version of ESI journal list that
                                                                                      constructed a preliminary classification schema after
covers 11,727 journals with full titles, abbreviated titles and their
                                                                                      reviewing the literature. Then the author randomly selected
disciplines they belong to.
                                                                                      100 knowledge phrases for trial annotation, organized the
We designed a pipeline to determine the ESI disciplines of the
                                                                                      annotation details, and wrote an annotation specification
references. First, 7,393 distinct journal titles were obtained from
                                                                                      document that provides detailed definition to each category
the 104,888 reference records with the citation type of ‘journal’
                                                                                      with a few exemplar concepts.
and with DOI/PubMed ID. We manually completed the full titles
                                                                               2.     Pre-annotation. Pre-annotation training was carried out for
for the abbreviated journal titles that cannot be found in the ESI
                                                                                      the two coders. Subsequently, two coders independently
journal list but with more than 2 references. Next, we identified
                                                                                      annotated 500 identical knowledge phrases randomly
the disciplines of references by matching their journal titles with
                                                                                      selected for pre-annotation. After labeling, we calculated the
the journal titles in ESI. However, there were still 8,393 reference
                                                                                      kappa statistics to assess the agreement of the two coders.
records without the ESI discipline information. Since the coverage
                                                                                      The kappa was equal to 0.65, which was not as good as
of journals in ESI is not as broad as in WoS journal list, the WoS
                                                                                      expected. Thus, two coders discussed the ambiguous cases
subject categories were then used to infer the ESI disciplines of
                                                                                      with a professional in the eHealth field. We find some
the journal titles that were not matched directly. We designed a
                                                                                      phrases may not make sense if they appear alone, but they
method to map the WoS subject categories into the ESI
                                                                                      are meaningful in the given context, therefore, there were
disciplines. We calculated the likelihood of a WoS subject
                                                                                      many phrases that categorized into the research subject
category belonging to an ESI discipline through its journals whose
                                                                                      category or others category by different coders. After the
ESI disciplines are known. The ESI discipline with the highest
                                                                                      discussion, two coders reached a consensus.
probability was then determined as the ESI discipline of the WoS
category. If a journal has multiple WoS subject categories, we
also chose the ESI discipline that has the highest probability with            TABLE 1. The classification framework of associated
all the WoS categories.                                                        knowledge phrases.
Finally, approximately 94.09% of journal reference records
(98,685) get the discipline information.                                       Category          Description         Exemplar phrases
                                                                               Research          Subject  terms      idepression,       diabetes,
2.3    Extracting and Classifying                       Associated             subject           related      to     health information
      Knowledge Phrases                                                                          research
Citation contexts contain information about the cited articles                                   problem
relevant to the citing papers[9-10]. We contempt that the words                Theory            Theory related      TAM, social cognitive
occurred in both citation context and the corresponding cited                                    phrases             theory,    transtheoretical
                                                                                                                     model
paper can reflect the explicit knowledge association between the
                                                                               Research          Methodology         systematic         review,
two to a certain extent. In this study, we used the title and abstract
                                                                               methodology       used in research    analysis, meta analysis,
to represent a cited paper (i.e., a reference) due to the difficulty of                                              randomize control trial
obtaining full text. We extracted noun phrases that carry                      Technology        Technique,          mobile     phone,      web,
meaningful concepts from the citances as well as the titles and                                  device       and    smartphone, app
abstracts of the references by using the package of spaCy, an                                    system that used
open-source python natural language processing toolkit. Noun                                     in research
phrases with a single character or some wildcards (e.g., “#”, “*”,             Entity            Human-related       patient, woman, child,
“@”, etc.) were removed. So were those starting or ending with a                                 research object     adolescent
number. Stop words listed in the NLTK package were also                        Data              Phrases related     twitter, qualitative datum,
                                                                                                 to dataset, data    clinical datum
eliminated. Acronyms were identified and expanded into their full
                                                                                                 source and data
forms by using the scispaCy package. We used both the acronyms
                                                                                                 material
and their full forms in the matching process, but only retained the            Others            Other     phrases   study, use, result, outcome,
raw forms of the noun phrases extracted from the citances. Thus,                                 that cannot be      number, Canada, project,




                                                                          41
                     EEKE 2020 - Workshop on Extraction and Evaluation of Knowledge Entities from Scientific Documents




                  included in the    USA                                    associated with the top 10 disciplines (98.29% of all). Table 4
                  above categories                                          presents the frequency of associated knowledge phrases by
                                                                            discipline. It should be noted that only references with abstracts
                                                                            were used to extract associated knowledge phrases, therefore, the
3.    Formal annotation. The two coders annotated all 24,132
                                                                            numbers of in-text citations in Table 4 are different from those in
      unique phrases. During the annotation process, two coders
                                                                            Table 3. Clinical Medicine contains the most associated
      maintained communication with the professional in the
                                                                            knowledge phrases, followed by Social Sciences, General and
      eHealth field to reach an agreement.
                                                                            Psychiatry/Psychology. The ranks of disciplines by the frequency
                                                                            of associated knowledge phrases are in harmony with the ranks by
Our final framework contains seven categories, including research           the frequency of in-text citations.
subject, theory, research methodology, technology, entity, data
and others, which are defined in detail in Table 1.
                                                                            TABLE 3. Distribution of references over source disciplines

3     Results                                                                                              Unique        CountOne     In-text
                                                                            Rank    Discipline
                                                                                                           references    citations    citations
3.1    Dataset Description
                                                                            1       Clinical Medicine      24802         47968        66673
We obtained 3,221 papers from the eHealth field with the
                                                                                    Social     Sciences,
publication year between 1999 and 2018. Some characteristics of             2                              12812         22530        30196
                                                                                    General
our dataset for analysis are given in Table 2. In total, 115,456
                                                                                    Psychiatry         /
citances and 98,685 reference records (55,744 distinct articles)            3                              9371          15915        21606
                                                                                    Psychology
with discipline information were extracted from our corpus. The
                                                                                    Neuroscience      &
98,685 reference records were cited a total of 134,516 times (i.e.,         4                              1914          2414         3152
                                                                                    Behavior
in-text references) in all citances. Roughly 90% of the reference
records have abstracts.                                                     5       Multidisciplinary      1259          2052         2754
                                                                            6       Computer Science       1153          1660         2278
TABLE 2. Characteristics of our dataset for analysis                        7       Immunology             839           1185         1464
                                                                                    Economics      &
Characteristics                                     Statistics              8                              693           949          1222
                                                                                    Business
Citing papers                                       3,221                           Biology        &
                                                                            9                              632           1041         1398
                                                                                    Biochemistry
Citances                                            115,456                         Pharmacology   &
                                                                            10                             567           710          963
Reference records                                   98,685                          Toxicology
Reference records with abstract                     89,649                          Agricultural
                                                                            11                             546           839          1145
                                                                                    Sciences
Unique reference articles                           55,744
                                                                            12      Engineering            303           357          441
In-text references                                  134,516                         Molecular Biology
                                                                            13                             254           323          425
In-text references with abstract                    123,206                         & Genetics
                                                                            14      Mathematics            181           271          312
                                                                                    Environment     /
                                                                            15                             181           216          249
3.2    Source Disciplines                                                           Ecology
To address our research question, we analyzed the distribution of           16      Chemistry              80            91           94
references over disciplines. Table 3 shows the number of unique             17      Microbiology           51            53           44
cited articles, CountOne citations, and in-text citations for the 22                Plant & Animal
disciplines. The CountOne citations were obtained by counting               18                             46            47           38
                                                                                    Science
each reference only once in a citing paper, whereas the in-text
                                                                            19      Physics                27            30           36
citations count all the mentions of references in the paper[14]. The
disciplines are ranked by the number of unique references. It’s             20      Geosciences            26            26           15
observed that the ranks of the disciplines by CountOne citations            21      Materials Science      5             6            9
are the same as the ranks by in-text citations. In the following
analysis, we choose the top 10 disciplines with most unique                 22      Space Science          2             2            2
references, which cover 96.95% of all unique references.

3.3    Distribution of Associated                    Knowledge              In addition, we calculated the knowledge density in the flow (i.e.,
      Phrases over Disciplines                                              the average number of phrases per citation) through dividing the
                                                                            frequency of phrases by the number of citations for each
In total, 215,138 associated knowledge phrases were extracted               discipline. On average, every citation from the disciplines carried
between the citances and the 123,206 in-text references with                more than one associated knowledge phrase. The scores of
abstracts. Here, we only analyze 211,454 knowledge phrases                  knowledge density are slightly different between the 10




                                                                       42
                    EEKE 2020 - Workshop on Extraction and Evaluation of Knowledge Entities from Scientific Documents


                                                                                                                                   J.Mao et al.

disciplines. Pharmacology & Toxicology exceeds other source                  other disciplines. Computer Science has a higher proportion of
disciplines, with the most phrases per citation, while Computer              technology phrases comparing with other disciplines. This could
Science contains the fewest phrases per citation.                            be explained by that Computer Science provides the study of
                                                                             eHealth with a lot of technique support, and many eHealth
TABLE 4. The frequency of associated knowledge phrases                       research problems are related to Computer Science.

Disciplines                  Knowledge       In-text     Knowledge
                             phrases         citations   density
Clinical Medicine            113,424         61,385      1.848

Social Sciences, General     46,532          28,008      1.661

Psychiatry / Psychology      31,765          19,446      1.633

Neuroscience            &    5,365           3,014       1.780
Behavior
Multidisciplinary            4,470           2,561       1.745

Computer Science             2,750           1,979       1.390

Immunology                   2,434           1,352       1.800

Biology & Biochemistry       1,905           1,301       1.464

Pharmacology         &       1,620           876         1.849
Toxicology
Economics & Business         1,189           855         1.391
                                                                             Figure 1: Frequency distribution of knowledge categories.

3.4 Knowledge Category Distribution among
Source Disciplines
According to the annotation result, the number of associated
knowledge phrases is shown for each category in Figure 1. The
phrases in the category of research subject are the most,
accounting for 43.8%. It shows that authors usually cite references
related to their research subject. One noticeable thing is that there
are many phrases in others, which is the second most. Such
phrases often involve specific authors’ names, geolocations,
specific projects, funding and some meaningless phrases. These
phrases are not subdivided in our classification framework. In
addition, the categories of entity and technology have more
phrases than research methodology. This result may be due to the
field of our corpus is medical-related, the research in which
requires the use of many medical instruments, and the research               Figure 2: Frequency distribution of knowledge categories over
entities it targets often varies in terms of research subjects (e.g.,        disciplines.
different diseases).
Figure 2 presents the number of associated knowledge phrases in
different categories over the disciplines. The knowledge category            4   Discussion & Conclusion
distribution over different disciplines is significantly different           This study investigates the knowledge flow towards the
(Pearson Chi Square test, p-value < 0.001). The top 3 disciplines,           interdisciplinary field of eHealth from the perspective of
Clinical Medicine, Social Sciences, General, and Psychiatry/                 knowledge content. We extracted the knowledge phrases shared
Psychology, supply the most numbers of phrases in all categories.            between the citances in the field with the references to represent
For each discipline, most of the associated knowledge phrases are            knowledge content spread from source disciplines to the field. A
research subjects.                                                           classification framework was applied to annotate the identified
                                                                             knowledge phrases to explore the knowledge types of the phrases.
In general, the distribution of associated knowledge phrases in              The interdisciplinary features of eHealth are shown by analyzing
each discipline over the categories are similar to the overall               the associated knowledge phrases.
distribution in the entire dataset. However, a few exceptions are            The findings of this study could provide a few insightful
also observed. The proportion of theory phrases over all the                 implications on interdisciplinary knowledge integration. The
phrases in Economics & Business are much higher than that in                 result shows that the ranks of disciplines by the frequency of




                                                                        43
                         EEKE 2020 - Workshop on Extraction and Evaluation of Knowledge Entities from Scientific Documents




associated phrases are consistent with the ranks by the frequency                             [6] Hassan S U, Safder I, Akram A, and Kamiran F. 2018. A novel machine-
                                                                                                   learning approach to measuring scientific knowledge flows using citation
of in-text citations. It means that to measure interdisciplinary                                   context analysis. Scientometrics 116(2), 973-996.
knowledge flow, an indicator based on the frequency of shared                                 [7] Thelwall M. 2019. Should citations be counted separately from each originating
phrases may produce similar results with the indicator using the                                   section?. Journal of Informetrics 13(2), 658-678.
frequency of references, in that the in-text references from                                  [8] Pagliari C, Sloan D, Gregor P, Sullivan F, Detmer D, Kahan J P, ... and
                                                                                                   MacGillivray S. 2005. What is eHealth (4): a scoping exercise to map the field.
different disciplines often carry similar amounts of phrases (Table                                Journal of medical Internet research 7(1), e9.
4). Associated phrases can indicate the spread content, which may                             [9] Small H. 1978. Cited Documents as Concept Symbols. Social Studies of
be useful to generate knowledge map of interdisciplinary                                           Science 8(3), 327-340.
                                                                                              [10] Elkiss A, Shen S, Fader A, Erkan G, States D, and Radev D. 2008. Blind men
knowledge integration. However, they do not directly differentiate                                 and elephants: What do citation summaries tell us about a research article?.
citations, thus, it is not enough to only consider phrase frequencies                              Journal of the American Society for Information Science and Technology 59(1),
to measure interdisciplinary knowledge integration at the aspect of                                51-62.
content.                                                                                      [11] Wang, Y. and Zhang, C. 2018. What type of domain knowledge is cited by
                                                                                                   articles with high interdisciplinary degree? In Proceedings of the Association
The frequency distribution of knowledge phrases over the                                           for Information Science and Technology 55, 1, 919–921.
categories is heavily skewed. Except others, the most in-text                                 [12] Gupta, S. and Manning, C. 2011. Analyzing the dynamics of research by
references carry the phrases of research subject, followed by                                      extracting key aspects of scientific papers. In Proceedings of 5th International
                                                                                                   Joint Conference on Natural Language Processing, 1-9.
entity. The results show the distribution of different types of                               [13] Radoulov, R. 2008. Exploring automatic citation classification (master’s thesis).
knowledge from the source disciplines. The types of knowledge                                      Waterloo, Ontario, Canada: The University of Waterloo.
phrases can be used as an important feature to differentiate                                  [14] Ding Y, Liu X, Guo C, and Cronin B. 2013. The distribution of references
references, for instance, the motivations of citations. The                                        across texts: Some implications for citation analysis. Journal of Informetrics
                                                                                                   7(3), 583-592.
categories of knowledge will be helpful to understand the roles of
source disciplines in the knowledge integration of an
interdisciplinary field.
A few limitations can be identified as well. To obtain full text of
research articles, we only chose the two open access journals to
represent the field of eHealth, which may not cover all the articles
of this field. The problem of data deficiency is common in full-
text based domain analysis. To identify the knowledge transferred
from source disciplines to the interdisciplinary field, shared
phrases are extracted by using simple text matching. However,
synonyms are often used in citing others’ work, thus the coverage
of the shared knowledge may be in short.
We also identified some directions of future research. We
manually annotated the categories of associated phrases. To
support the analysis on large scale datasets, automating the
classification of spread knowledge is on great demand, which is a
challenging task of our interest. This motivates us to design a
more general classification framework to analyze the content of
knowledge spread between disciplines. In addition, recent
machine learning techniques will be applied to this task in our
future study.

ACKNOWLEDGMENTS
This study was funded by the National Natural Science
Foundation of China (Grant No. 71804135) and Ministry of
Education Humanities and Social Sciences project in China (Grant
No.19YJC870018). We also thank Jing Tang for helping us with
the data processing.

REFERENCES
[1] Porter A L, Cohen A S, Roessner J D, et al. 2007. Measuring researcher
    interdisciplinarity. Scientometrics 72(1),117-147.
[2] Yan E. 2016. Disciplinary knowledge production and diffusion in science.
    Journal of the Association for Information Science and Technology 67(9),
    2223-2245.
[3] Ding Y, Zhang G, Chambers T, Song M, Wang X, and Zhai C. 2014. Content‐
    based citation analysis: The next generation of citation analysis. Journal of the
    association for information science and technology 65(9), 1820-1833.
[4] Zhang G, Ding Y, and Milojević S. 2013. Citation Content Analysis (CCA): A
    Framework for Syntactic and Semantic Analysis of Citation Content. Journal of
    the Association for Information Science and Technology 64(7), 1490-1503.
[5] Zhu X, Turney P, Lemire D, and Vellino A. 2015. Measuring academic
    influence: Not all citations are equal. Journal of the Association for Information
    Science and Technology 66(2), 408–427.




                                                                                         44