=Paper=
{{Paper
|id=Vol-2658/paper4
|storemode=property
|title=Investigating Interdisciplinary Knowledge Flow from the Content Perspective of Citances
|pdfUrl=https://ceur-ws.org/Vol-2658/paper4.pdf
|volume=Vol-2658
|authors=Jin Mao,Shiyun Wang,Xianli Shang
|dblpUrl=https://dblp.org/rec/conf/jcdl/MaoWS20
}}
==Investigating Interdisciplinary Knowledge Flow from the Content Perspective of Citances==
EEKE 2020 - Workshop on Extraction and Evaluation of Knowledge Entities from Scientific Documents Investigating interdisciplinary knowledge flow from the content perspective of citances Jin Mao Shiyun Wang Xianli Shang† School of Information Management School of Information Management Business School Wuhan University Wuhan University Xinyang Agriculture and Forestry Univ Wuhan, Hubei, China Wuhan, Hubei, China Xinyang, Henan, China maojin@whu.edu.cn wangsy2@whu.edu.cn 47655282@qq.com ABSTRACT feature of interdisciplinary research is the integration of knowledge from multiple disciplines out of the field [1]. Methods, Interdisciplinary research is playing an important role in modern theories, tools and concepts from different disciplines are often science. In recent years, a lot of studies have measured integrated to solve complex research problems of interdisciplinary interdisciplinary knowledge flow based on the frequency of research. To understand the characteristics of interdisciplinary citations. However, this approach does not consider the content of knowledge integration, citation analysis has often been used to knowledge carried in the citations. In this study, we attempt to examine knowledge flow among disciplines[2]. Conventionally, investigate the content of knowledge flow towards an the knowledge flow to a field is simply measured by the number interdisciplinary field by analyzing the citation sentences (i.e., of references cited by the papers in the field. Different importance, citances ) in the articles of the field. An emerging field, eHealth, motivations and many other aspects of citations in a paper are is chosen in the case study. The associated knowledge phrases ignored. between citances and the references of the field are identified and Recent studies have shifted to investigate interdisciplinary categorized to analyze the content and categories of knowledge knowledge flow from a finer-granular perspective by looking into spread from the source disciplines to the field. The result shows the content and contexts of citations. Citation contexts have that the ranks of disciplines by the frequency of associated phrases become more easily obtained in recent years, which embed the are consistent with the ranks by the frequency of in-text citations. syntactic (e.g., the location of section and rhetoric style) and Distribution of associated phrases over categories and disciplines semantic (e.g., the meaning of citation content) information of is also analyzed. The associated phrases of research subject are citations[3]. Citation contexts have been used to differentiate the the most, followed by entity. This study contributes to the functions[4-5], importance[6] and knowledge contributions[7] of understanding of content characteristics about interdisciplinary different citations. The rich information of citation contexts knowledge integration. enables the analysis on what knowledge is integrated into an interdisciplinary field. CCS CONCEPTS In this study, we attempt to explore the content of knowledge • Theory of computation ~ Semantics and reasoning ~ Program integrated into an interdisciplinary field, eHealth, by analyzing the semantics ~ Categorical semantics; • Information systems ~ citances. The field of eHealth is an emerging field, referring to all Information systems applications ~ Data mining; • Information aspects of the intersection of health care and the Internet[8]. A systems ~ Information retrieval ~ Retrieval tasks and goals ~ citance that provides the context of a citation is denoted as the Information extraction sentence that contains in-text reference information. Our research questions are what knowledge is integrated from the source KEYWORDS disciplines to eHealth, and what types are the knowledge. In this study, we design an approach to analyze the content and Interdisciplinary research, Content classification, eHealth, In-text categories of the knowledge shared between citances and the reference, Knowledge integration references. This study contributes to understanding the content characteristics of interdisciplinary knowledge integration. 1 Introduction Interdisciplinary research has become an important research 2 Methodology paradigm and many recent significant breakthroughs in science are the fruits of interdisciplinary research. One fundamental 2.1 Data Collection † Corresponding author. Two high impact eHealth journals, Journal of Medical Internet Research and JMIR mHealth and uHealth, were selected as our data sources. All 3,416 articles with XML files published from 1999 to 2018 were collected. We only focused on the 3,221 articles with the types of Original papers, Reviews and Viewpoints. The metadata of references, including title, citation type, journal name, DOI, PubMed ID, and publish year, were Copyright 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 40 EEKE 2020 - Workshop on Extraction and Evaluation of Knowledge Entities from Scientific Documents J.Mao et al. parsed from the XML files. Sentences were extracted by using the an associated knowledge phrase is defined as a noun phrase punctuations (periods, question marks, etc.) as sentence appearing in both a citance and its reference, which could be boundaries, then citances with in-text references were identified. regarded as the knowledge transferred from the reference to the In total, 115,456 citances and 140,572 reference records were citing paper. obtained. To complete the abstracts of references, the reference records To analyze the types of the knowledge that flows to the eHealth were fetched by searching PubMed for PubMed ID or Web of field, we designed a classification framework of associated Science (WoS) for DOI. In total, the abstracts of 89,649 reference knowledge phrases based on the previous studies [11-13]. Two records were collected. graduate students familiar with the field of eHealth were recruited to annotate the categories of the associated knowledge phrases by 2.2 Source Discipline Identification following the steps: To explore the source of input knowledge, the references were then categorized into the 22 disciplines of Essential Science 1. Initializing knowledge classification framework. One author Indicators (ESI). We used the 2018 version of ESI journal list that constructed a preliminary classification schema after covers 11,727 journals with full titles, abbreviated titles and their reviewing the literature. Then the author randomly selected disciplines they belong to. 100 knowledge phrases for trial annotation, organized the We designed a pipeline to determine the ESI disciplines of the annotation details, and wrote an annotation specification references. First, 7,393 distinct journal titles were obtained from document that provides detailed definition to each category the 104,888 reference records with the citation type of ‘journal’ with a few exemplar concepts. and with DOI/PubMed ID. We manually completed the full titles 2. Pre-annotation. Pre-annotation training was carried out for for the abbreviated journal titles that cannot be found in the ESI the two coders. Subsequently, two coders independently journal list but with more than 2 references. Next, we identified annotated 500 identical knowledge phrases randomly the disciplines of references by matching their journal titles with selected for pre-annotation. After labeling, we calculated the the journal titles in ESI. However, there were still 8,393 reference kappa statistics to assess the agreement of the two coders. records without the ESI discipline information. Since the coverage The kappa was equal to 0.65, which was not as good as of journals in ESI is not as broad as in WoS journal list, the WoS expected. Thus, two coders discussed the ambiguous cases subject categories were then used to infer the ESI disciplines of with a professional in the eHealth field. We find some the journal titles that were not matched directly. We designed a phrases may not make sense if they appear alone, but they method to map the WoS subject categories into the ESI are meaningful in the given context, therefore, there were disciplines. We calculated the likelihood of a WoS subject many phrases that categorized into the research subject category belonging to an ESI discipline through its journals whose category or others category by different coders. After the ESI disciplines are known. The ESI discipline with the highest discussion, two coders reached a consensus. probability was then determined as the ESI discipline of the WoS category. If a journal has multiple WoS subject categories, we also chose the ESI discipline that has the highest probability with TABLE 1. The classification framework of associated all the WoS categories. knowledge phrases. Finally, approximately 94.09% of journal reference records (98,685) get the discipline information. Category Description Exemplar phrases Research Subject terms idepression, diabetes, 2.3 Extracting and Classifying Associated subject related to health information Knowledge Phrases research Citation contexts contain information about the cited articles problem relevant to the citing papers[9-10]. We contempt that the words Theory Theory related TAM, social cognitive occurred in both citation context and the corresponding cited phrases theory, transtheoretical model paper can reflect the explicit knowledge association between the Research Methodology systematic review, two to a certain extent. In this study, we used the title and abstract methodology used in research analysis, meta analysis, to represent a cited paper (i.e., a reference) due to the difficulty of randomize control trial obtaining full text. We extracted noun phrases that carry Technology Technique, mobile phone, web, meaningful concepts from the citances as well as the titles and device and smartphone, app abstracts of the references by using the package of spaCy, an system that used open-source python natural language processing toolkit. Noun in research phrases with a single character or some wildcards (e.g., “#”, “*”, Entity Human-related patient, woman, child, “@”, etc.) were removed. So were those starting or ending with a research object adolescent number. Stop words listed in the NLTK package were also Data Phrases related twitter, qualitative datum, to dataset, data clinical datum eliminated. Acronyms were identified and expanded into their full source and data forms by using the scispaCy package. We used both the acronyms material and their full forms in the matching process, but only retained the Others Other phrases study, use, result, outcome, raw forms of the noun phrases extracted from the citances. Thus, that cannot be number, Canada, project, 41 EEKE 2020 - Workshop on Extraction and Evaluation of Knowledge Entities from Scientific Documents included in the USA associated with the top 10 disciplines (98.29% of all). Table 4 above categories presents the frequency of associated knowledge phrases by discipline. It should be noted that only references with abstracts were used to extract associated knowledge phrases, therefore, the 3. Formal annotation. The two coders annotated all 24,132 numbers of in-text citations in Table 4 are different from those in unique phrases. During the annotation process, two coders Table 3. Clinical Medicine contains the most associated maintained communication with the professional in the knowledge phrases, followed by Social Sciences, General and eHealth field to reach an agreement. Psychiatry/Psychology. The ranks of disciplines by the frequency of associated knowledge phrases are in harmony with the ranks by Our final framework contains seven categories, including research the frequency of in-text citations. subject, theory, research methodology, technology, entity, data and others, which are defined in detail in Table 1. TABLE 3. Distribution of references over source disciplines 3 Results Unique CountOne In-text Rank Discipline references citations citations 3.1 Dataset Description 1 Clinical Medicine 24802 47968 66673 We obtained 3,221 papers from the eHealth field with the Social Sciences, publication year between 1999 and 2018. Some characteristics of 2 12812 22530 30196 General our dataset for analysis are given in Table 2. In total, 115,456 Psychiatry / citances and 98,685 reference records (55,744 distinct articles) 3 9371 15915 21606 Psychology with discipline information were extracted from our corpus. The Neuroscience & 98,685 reference records were cited a total of 134,516 times (i.e., 4 1914 2414 3152 Behavior in-text references) in all citances. Roughly 90% of the reference records have abstracts. 5 Multidisciplinary 1259 2052 2754 6 Computer Science 1153 1660 2278 TABLE 2. Characteristics of our dataset for analysis 7 Immunology 839 1185 1464 Economics & Characteristics Statistics 8 693 949 1222 Business Citing papers 3,221 Biology & 9 632 1041 1398 Biochemistry Citances 115,456 Pharmacology & 10 567 710 963 Reference records 98,685 Toxicology Reference records with abstract 89,649 Agricultural 11 546 839 1145 Sciences Unique reference articles 55,744 12 Engineering 303 357 441 In-text references 134,516 Molecular Biology 13 254 323 425 In-text references with abstract 123,206 & Genetics 14 Mathematics 181 271 312 Environment / 15 181 216 249 3.2 Source Disciplines Ecology To address our research question, we analyzed the distribution of 16 Chemistry 80 91 94 references over disciplines. Table 3 shows the number of unique 17 Microbiology 51 53 44 cited articles, CountOne citations, and in-text citations for the 22 Plant & Animal disciplines. The CountOne citations were obtained by counting 18 46 47 38 Science each reference only once in a citing paper, whereas the in-text 19 Physics 27 30 36 citations count all the mentions of references in the paper[14]. The disciplines are ranked by the number of unique references. It’s 20 Geosciences 26 26 15 observed that the ranks of the disciplines by CountOne citations 21 Materials Science 5 6 9 are the same as the ranks by in-text citations. In the following analysis, we choose the top 10 disciplines with most unique 22 Space Science 2 2 2 references, which cover 96.95% of all unique references. 3.3 Distribution of Associated Knowledge In addition, we calculated the knowledge density in the flow (i.e., Phrases over Disciplines the average number of phrases per citation) through dividing the frequency of phrases by the number of citations for each In total, 215,138 associated knowledge phrases were extracted discipline. On average, every citation from the disciplines carried between the citances and the 123,206 in-text references with more than one associated knowledge phrase. The scores of abstracts. Here, we only analyze 211,454 knowledge phrases knowledge density are slightly different between the 10 42 EEKE 2020 - Workshop on Extraction and Evaluation of Knowledge Entities from Scientific Documents J.Mao et al. disciplines. Pharmacology & Toxicology exceeds other source other disciplines. Computer Science has a higher proportion of disciplines, with the most phrases per citation, while Computer technology phrases comparing with other disciplines. This could Science contains the fewest phrases per citation. be explained by that Computer Science provides the study of eHealth with a lot of technique support, and many eHealth TABLE 4. The frequency of associated knowledge phrases research problems are related to Computer Science. Disciplines Knowledge In-text Knowledge phrases citations density Clinical Medicine 113,424 61,385 1.848 Social Sciences, General 46,532 28,008 1.661 Psychiatry / Psychology 31,765 19,446 1.633 Neuroscience & 5,365 3,014 1.780 Behavior Multidisciplinary 4,470 2,561 1.745 Computer Science 2,750 1,979 1.390 Immunology 2,434 1,352 1.800 Biology & Biochemistry 1,905 1,301 1.464 Pharmacology & 1,620 876 1.849 Toxicology Economics & Business 1,189 855 1.391 Figure 1: Frequency distribution of knowledge categories. 3.4 Knowledge Category Distribution among Source Disciplines According to the annotation result, the number of associated knowledge phrases is shown for each category in Figure 1. The phrases in the category of research subject are the most, accounting for 43.8%. It shows that authors usually cite references related to their research subject. One noticeable thing is that there are many phrases in others, which is the second most. Such phrases often involve specific authors’ names, geolocations, specific projects, funding and some meaningless phrases. These phrases are not subdivided in our classification framework. In addition, the categories of entity and technology have more phrases than research methodology. This result may be due to the field of our corpus is medical-related, the research in which requires the use of many medical instruments, and the research Figure 2: Frequency distribution of knowledge categories over entities it targets often varies in terms of research subjects (e.g., disciplines. different diseases). Figure 2 presents the number of associated knowledge phrases in different categories over the disciplines. The knowledge category 4 Discussion & Conclusion distribution over different disciplines is significantly different This study investigates the knowledge flow towards the (Pearson Chi Square test, p-value < 0.001). The top 3 disciplines, interdisciplinary field of eHealth from the perspective of Clinical Medicine, Social Sciences, General, and Psychiatry/ knowledge content. We extracted the knowledge phrases shared Psychology, supply the most numbers of phrases in all categories. between the citances in the field with the references to represent For each discipline, most of the associated knowledge phrases are knowledge content spread from source disciplines to the field. A research subjects. classification framework was applied to annotate the identified knowledge phrases to explore the knowledge types of the phrases. In general, the distribution of associated knowledge phrases in The interdisciplinary features of eHealth are shown by analyzing each discipline over the categories are similar to the overall the associated knowledge phrases. distribution in the entire dataset. However, a few exceptions are The findings of this study could provide a few insightful also observed. The proportion of theory phrases over all the implications on interdisciplinary knowledge integration. The phrases in Economics & Business are much higher than that in result shows that the ranks of disciplines by the frequency of 43 EEKE 2020 - Workshop on Extraction and Evaluation of Knowledge Entities from Scientific Documents associated phrases are consistent with the ranks by the frequency [6] Hassan S U, Safder I, Akram A, and Kamiran F. 2018. A novel machine- learning approach to measuring scientific knowledge flows using citation of in-text citations. It means that to measure interdisciplinary context analysis. Scientometrics 116(2), 973-996. knowledge flow, an indicator based on the frequency of shared [7] Thelwall M. 2019. Should citations be counted separately from each originating phrases may produce similar results with the indicator using the section?. Journal of Informetrics 13(2), 658-678. frequency of references, in that the in-text references from [8] Pagliari C, Sloan D, Gregor P, Sullivan F, Detmer D, Kahan J P, ... and MacGillivray S. 2005. What is eHealth (4): a scoping exercise to map the field. different disciplines often carry similar amounts of phrases (Table Journal of medical Internet research 7(1), e9. 4). Associated phrases can indicate the spread content, which may [9] Small H. 1978. Cited Documents as Concept Symbols. Social Studies of be useful to generate knowledge map of interdisciplinary Science 8(3), 327-340. [10] Elkiss A, Shen S, Fader A, Erkan G, States D, and Radev D. 2008. Blind men knowledge integration. However, they do not directly differentiate and elephants: What do citation summaries tell us about a research article?. citations, thus, it is not enough to only consider phrase frequencies Journal of the American Society for Information Science and Technology 59(1), to measure interdisciplinary knowledge integration at the aspect of 51-62. content. [11] Wang, Y. and Zhang, C. 2018. What type of domain knowledge is cited by articles with high interdisciplinary degree? In Proceedings of the Association The frequency distribution of knowledge phrases over the for Information Science and Technology 55, 1, 919–921. categories is heavily skewed. Except others, the most in-text [12] Gupta, S. and Manning, C. 2011. Analyzing the dynamics of research by references carry the phrases of research subject, followed by extracting key aspects of scientific papers. In Proceedings of 5th International Joint Conference on Natural Language Processing, 1-9. entity. The results show the distribution of different types of [13] Radoulov, R. 2008. Exploring automatic citation classification (master’s thesis). knowledge from the source disciplines. The types of knowledge Waterloo, Ontario, Canada: The University of Waterloo. phrases can be used as an important feature to differentiate [14] Ding Y, Liu X, Guo C, and Cronin B. 2013. The distribution of references references, for instance, the motivations of citations. The across texts: Some implications for citation analysis. Journal of Informetrics 7(3), 583-592. categories of knowledge will be helpful to understand the roles of source disciplines in the knowledge integration of an interdisciplinary field. A few limitations can be identified as well. To obtain full text of research articles, we only chose the two open access journals to represent the field of eHealth, which may not cover all the articles of this field. The problem of data deficiency is common in full- text based domain analysis. To identify the knowledge transferred from source disciplines to the interdisciplinary field, shared phrases are extracted by using simple text matching. However, synonyms are often used in citing others’ work, thus the coverage of the shared knowledge may be in short. We also identified some directions of future research. We manually annotated the categories of associated phrases. To support the analysis on large scale datasets, automating the classification of spread knowledge is on great demand, which is a challenging task of our interest. This motivates us to design a more general classification framework to analyze the content of knowledge spread between disciplines. In addition, recent machine learning techniques will be applied to this task in our future study. ACKNOWLEDGMENTS This study was funded by the National Natural Science Foundation of China (Grant No. 71804135) and Ministry of Education Humanities and Social Sciences project in China (Grant No.19YJC870018). We also thank Jing Tang for helping us with the data processing. REFERENCES [1] Porter A L, Cohen A S, Roessner J D, et al. 2007. Measuring researcher interdisciplinarity. Scientometrics 72(1),117-147. [2] Yan E. 2016. Disciplinary knowledge production and diffusion in science. Journal of the Association for Information Science and Technology 67(9), 2223-2245. [3] Ding Y, Zhang G, Chambers T, Song M, Wang X, and Zhai C. 2014. Content‐ based citation analysis: The next generation of citation analysis. Journal of the association for information science and technology 65(9), 1820-1833. [4] Zhang G, Ding Y, and Milojević S. 2013. Citation Content Analysis (CCA): A Framework for Syntactic and Semantic Analysis of Citation Content. Journal of the Association for Information Science and Technology 64(7), 1490-1503. [5] Zhu X, Turney P, Lemire D, and Vellino A. 2015. Measuring academic influence: Not all citations are equal. Journal of the Association for Information Science and Technology 66(2), 408–427. 44