=Paper=
{{Paper
|id=Vol-1610/paper6
|storemode=property
|title=Exploring the Leading Authors and Journals in Major Topics by Citation Sentences and Topic Modeling
|pdfUrl=https://ceur-ws.org/Vol-1610/paper6.pdf
|volume=Vol-1610
|authors=Ha Jin Kim,Juyoung An,Yoo Kyung Jeong,Min Song
|dblpUrl=https://dblp.org/rec/conf/jcdl/KimAJS16
}}
==Exploring the Leading Authors and Journals in Major Topics by Citation Sentences and Topic Modeling==
BIRNDL 2016 Joint Workshop on Bibliometric-enhanced Information Retrieval and NLP for Digital Libraries
Exploring the leading authors and journals in major
topics by citation sentences and topic modeling
Ha Jin Kim1, Juyoung An1, Yoo Kyung Jeong1, Min Song1
1Department of Library and Information Science, Yonsei University, 50 Yonsei-ro,
Seodaemun-gu, Seoul, South Korea
{hajin_228, anjy, yk.jeong, min.song}@yonsei.ac.kr
Abstract. Citation plays an important role in understanding the knowledge shar-
ing among scholars. Citation sentences embed useful contents that signify the
influence of cited authors on shared ideas, and express own opinion of citing
authors to others' articles. The purpose of the study is to provide a new lens to
analyze the topical relationship embedded in the citation sentences in an inte-
grated manner. To this end, we extract citation sentences from full-text articles
in the field of Oncology. In addition, we adopt Author-Journal-Topic (AJT)
model to take both authors and journals into consideration of topic analysis. For
the study, we collect the 6,360 full-text articles from PubMed Central and select
the top 15 journals on Oncology. By applying AJT model, we identify what the
major topics are shared among researchers in Oncology and which authors and
journal lead the idea exchange in sub-disciplines of Oncology.
Keywords: text mining; citation analysis; topic modelling; bibliometrics
1 Introduction
As the size of data on the web continues to increase in an exponential manner, find-
ing valuable meaning between data becomes of paramount importance in many re-
search areas. In the information science field, citations are challenging, pivotal materi-
als to discover the relationship between academic documents because citations present
the description of authors' ideas and the hidden relationship between authors and doc-
uments. The earliest works focused mainly on classifying the citation behaviors and
discovering the citation reasons with limited data such as the location of citation sen-
tences and the number of references [1,2].
Since the mid-1990s, with the development of computer technology, citation content
analysis was elaborated by applying data analysis techniques like text-mining or natural
language processing. Zhang et al. [3] present citation analysis based on sematic and
syntactic approaches. Semantic-based citation analysis is performed by qualitative
analysis to discover the citation motivation and citation classification. On the other
hand, syntactic-based citation analysis can be conducted by citation location and cita-
tion frequency, which reveals the hidden relation of authors by using meta-data of doc-
uments such as journal, venue of publication, affiliation of authors, etc. Following their
42
BIRNDL 2016 Joint Workshop on Bibliometric-enhanced Information Retrieval and NLP for Digital Libraries
study, Ding et al. [4] propose a theoretical methodology through content citation anal-
ysis. However, these analyses are somewhat limited to the explicit context that primar-
ily represents their own ideas and arguments.
The main goal of the paper is to discover the implicit topical relationships buried in
citation sentences by utilizing the citation information from the author’s perspective of
sharing other authors’ point of view. Implicitness of the topical relationship is realized
by using citation sentences as the input for the topic modeling technique. In this study,
a citation sentence indicates the sentence including citation expression consisting of
year and author of the cited work. In general, the citation sentence contains brief content
of cited work and opinion that the author of citing work on the cited work. We claim
that citation sentences reveal interesting characteristics of scholarly communication
such as influence, idea exchange, justification for citer’s arguments, etc. We assume
that using citation sentences for topic analysis reveals aforementioned characteristics.
To explore such intellectual space created by citation sentences, we take both authors
and journals into consideration of topic analysis. To this end, we applied Author-Con-
ference-Topic (ACT) model proposed by Tang et al. [5] for our topic analysis in relation
with both authors and journals, which is called Author-Journal-Topic (AJT) topic
model. ACT model is a probabilistic topic model for simultaneously extracting topics
of papers, authors, and conferences. There are a few studies to analyze content of cita-
tion sentences. Most of previous studies focus on how the topic of document influences
citation and vice versa [6,7,8] using Topic Modeling. Kataria, Mitra, and Bhatia [8]
adapt citation to Author-Topic model [9] with the assumption that the context surround-
ing the citation anchor could be used to get topical information about the cited authors.
These studies including Tang et al. [10]’s ACT model are the examples of combining
topic modelling methods and citation content analysis. However, most previous studies
used metadata of documents. In this work, we focus on identifying the landscape of the
oncology field from a perspective of citation. By using citation sentences, our results
can indicate which authors are actively cited and which journals lead a certain topic.
The rest of the paper is organized as follows: Section 2 describes the proposed ap-
proach. Section 3 analyzes the topic modeling results. Section 4 concludes the paper
with the future work.
2 Methodology
2.1 Main idea
The basic assumption of the proposed approach is that citation sentences embed use-
ful contents signifying the influence of cited authors on shared ideas of citing authors.
Citation sentences are also considered as an invisible intellectual place for idea ex-
changing since citations are effective means of supporting and expressing their own
arguments by using other works. In the similar vein, Di Marco and Mercer [11] claim
that citation sentences play a major role in creating the relationship among relevant
authors within the similar research fields. With these assumptions, we are to explore
the implicitness of topic relationships resided in citation sentences from the integrated
43
BIRNDL 2016 Joint Workshop on Bibliometric-enhanced Information Retrieval and NLP for Digital Libraries
perspective by incorporating the citing authors and journal titles into interpreting the
topical relationships.
As shown in Figure 1, we utilized various features including citing authors, citing
sentences and journal titles for topic analysis. Authors in Figure 1 mean the citing au-
thors who write a paper and who cite other’s work. Citation sentences are the sentences
written by the authors when they cite other’s work in the paper, and journal titles are
the journal names publishing the citing authors’ paper. By employing AJT model with
these three parameters , we can discover which topics are the most salient ones referred
to frequently by researchers and who are the leading authors sharing other authors’
ideas in the research field and which journal leads such endeavor.
Fig. 1. Three parameters for AJT model
2.2 Data collection
For this study, we compile the dataset on the field of Oncology from PubMed Central
that provides the full-text in the biomedical field. We select top 15 journals of Oncology
by Thomson Reuter’s JCR and journal’s impact factor, and from these 15 journals, we
are able to collect 6,360 full-text articles.
2.3 Method
Fig. 2. Workflow
Figure 2 describes the workflow of our study. As mentioned earlier, with the full-
text articles collected from PubMed Central, we extract the citation sentences. Most
citation sentences are kept in the following format: (author, year), (reference number)
[reference number]. An example of such format is “(Author name, 2000 )”. We use the regular expression technique to
parse and extract the citation sentences, when the tag , appears on
the sentences after parsing XML records with the Java-based SAX parser.
We also parse other metadata for AJT model such as the name of authors and journal
titles. The author tags, and in-
side the , denote the list of authors who wrote the pa-
per. For journal, we extract the titles when the journal tags, and , are included in the tag of and . We also pre-
process extracted sentences by removing both functional and general words and apply-
ing the Porter’s stemming algorithm to improve the input for AJT Model.
2.4 AJT Model
For our study, we apply ACT [10] model with several metadata such as citation sen-
tences, journal titles and citing authors to develop AJT model. Our AJT model utilizes
journal titles and citation sentences instead of conference and abstract on documents.
The change of model is needed to analyze most influential topics in Oncology and to
find leading authors who frequently mention the active topics and to detect the journals
involved in such topics.
Fig. 3. Graphical representation and Notions of AJT model, which applies ACT model
(Applied Tang, J., Jin, R., & Zhang, J., 2008, p.1056, Figure 1, Table 1)
Like ACT model, AJT model assumes that each citing author is related to distribu-
tion over topics and each word in citation sentences is derived from a topic. In the AJT
model, the journal titles are related to each word. To determine a word (ω_Si) in citation
sentences (S), citing authors (x_Si) are consider for a word. Each citing author is asso-
ciated with a distributed topic. A topic is generated from the citing author-topic distri-
bution. The words and journal titles are generated from a specific topic. AJT model
presents (1) the distribution θ of A citing authors-topics, the distribution of ∅ of T topic-
words, and the distribution φ of T topic-journal titles; and (2) the following topic z_Si
and citing author x_Si for each word ω_Si. The detailed descriptions of the algorithm
are provided in the Tang et al.’s paper [5,10].
45
BIRNDL 2016 Joint Workshop on Bibliometric-enhanced Information Retrieval and NLP for Digital Libraries
3 Results and Analyses
For AJT model, we set the number of topics to 15 and finally select 8 topics as major
topics. Since we discovered that there are similar topics on our results, we calculated
the similarity between 15 topics to select the most representative topics. The topical
similarities are measured by each word on topics and we calculated the similarities of
two topics where each topic are represented in an array of a term vector. Through this
process, we chose 8 topics which have high topical similarities (over 0.5). Each topic
presents top 5 words from topic-word distribution, and 5 most related authors and jour-
nal titles are displayed along with each topic. By performing several times on the pilot
studies, we decided to choose top 5 words which are quite appropriate to describe each
topics.
The results of AJT-based topic modeling is shown in Table 1. We label topic 1
“breast cancer” whose top words include breast, expression women, and growth. Since
the dataset is compiled with citation sentences, it implies that the topic “breast cancer”
is a popular topic where researchers share and exchange ideas and facts related to breast
cancer. In relation to the topic “breast cancer”, the active authors of breast cancer are
Johnston Stephen RD, Colditz Graham A, and Sternlicht Mark D, and they share ideas
with others on breast cancer from our results. In terms of journals that provide a com-
mon place for idea sharing and communication, the journal “Breast Cancer Research”
is the top journal of topic 1, and its impact factor is 5.49. Authors such as Kurzrock
Razelle, and Axelrod Haley in group 4 are the leading researchers sharing ideas on the
topic “targeted therapy.” The topic 4 is associated with the targeted therapy represented
by words like mutations, treatments, therapy and disease. The two most influential jour-
nals in topic 4 are “Oncotarget” and “Journal of Thoracic Oncology” whose impact
factors are 6.36 and 5.28 respectively, which indicates that these two journals are the
major journals encouraging authors to share ideas and collaborate with each other on
cancer targeted therapy subject area. Authors like Zitgel Laurence, Galluzzi Lorenzo,
and Kroemer Guido in the author group 7 are the ones that actively share ideas about
the topic “Cancer Immunology.” Top concepts that are related to this topic are cell,
immune, clinical and antitumor. The top journal of the topic “Cancer Immunology” is
Oncoinmmunology whose impact factor is 6.266. Romagnani Paola and Salem Husein
K in topic 8 “Stem Cell” are the authors that communicate and share ideas actively with
each other in the given field, and the journal “Stem Cells” (impact factor: 6.523) is the
leading journal.
Table 1. The Results of AJT-based Topic Modeling in Oncology
Topic1 Topic2 Topic3 Topic4
Breast cancer Cancer epigenetics Leukemia Targeted therapy
breast methylation expression mutations
expression DNA mutations clinical
mammary expression AML treatment
risk gene treatment survival
46
BIRNDL 2016 Joint Workshop on Bibliometric-enhanced Information Retrieval and NLP for Digital Libraries
women histone leukemia resistance
Author group1 Author group2 Author group3 Author group4
Johnston Stephen RD Gray Steven G. Tefferi A Muller Patricia AJ
Colditz Graham A Mahlknecht Ulrich Anderson K C Vousden Karen H
Sternlicht Mark D Tollefsbol Trygve O. Ratajczak Janina Zaravinos Apostolos
Reis-Filho Jorge S Lichtenstein Anatoly V Schöffski P Dienstmann Rodrigo
Esteva Francisco J Williams David E Gjertsen B T Shtivelman Emma
Journal group1 Journal group2 Journal group3 Journal group4
Breast Cancer Re- Clinical Epigenetics Leukemia Oncotarget
search
Annals of Oncology Oncoimmunology Pigment Cell & Journal of Thoracic
Melanoma Research Oncology
Cancer Cell JNCI Annals of Oncology Annals of Oncology
Cancer Cell
Clinical Epigenetics Molecular Cancer Cancer Cell Clinical Epigenetics
JNCI Annals of Oncology Breast Cancer Re- Oncoimmunology
search
Topic5 Topic6 Topic7 Topic8
Molecular cancer Oncogene pathway Cancer Immunology Stem cell
expression cell cell stem
p53 activity immune expression
mutant activation expression differentiation
gene protein clinical MSCs
survival apoptosis responses growth
Author group5 Author group6 Author group7 Author group8
Clarke Paul A Melino Gerry Zitvogel Laurence Romagnani Paola
Workman Paul Martelli Alberto M Galluzzi Lorenzo Salem Husein K
Hoelder Swen McCubrey James A Kroemer Guido Thiemermann Chris
Akhavan David Blagosklonny Mi- Eggermont Alexan- Lako Majlinda
khail V der
Cassidy Liam D Steelman Linda S Vacchelli Erika Mellough Carla B
Journal group5 Journal group6 Journal group7 Journal group8
Cancer Cell Oncotarget Oncoimmunology Stem Cells
Neuro-Oncology Annals of Oncology Annals of Oncology Annals of Oncology
Oncotarget Cancer Cell Breast Cancer Re- Cancer Cell
search
Molecular Oncology Clinical Epigenetics Cancer Cell Clinical Epigenetics
Molecular Cancer Oncogene Clinical Epigenetics Molecular Cancer
We visualize topic keywords obtained from results of AJT-based topic model. We
construct the co-occurrence network and analyze which topic words play an important
role in this domain. Each node in the network represents a topic word, and an edge
represents a co-occurrence frequency between keywords. The size of nodes represents
47
BIRNDL 2016 Joint Workshop on Bibliometric-enhanced Information Retrieval and NLP for Digital Libraries
degree centrality and the color means network clusters obtained by using modularity
algorithm. This network consists of 100 nodes and 1,436 edges. As shown in Figure 4,
each topic belongs to a specific community, but shares some important topic keywords.
Especially, the topic words positioned at the center is represented core-keywords in
Oncology. Figure 4 indicates that these words are the essential concepts of the Oncol-
ogy domain. Along with the results of AJT-based topic models, we can infer the major
journals and authors develop their own research area based on these core-concepts.
Fig. 4. Network of topic keywords
The above results imply that the proposed approach identifies which topics are fre-
quently shared, who facilitates to exchange ideas, and which journals provide a place-
holder for it. Identification of the triple relationship among authors, journals, and topics
sheds new insight on understanding the well-discussed topics driven by the leading
journals and authors that play a mediator role in the development of Oncology.
4 Conclusion
One of the major research problems in bibliometrics is how to map out the intellec-
tual structure of a research field. The proposed approach tackles such research problem
by utilizing citation sentences and AJT model. By using citation sentences as the input
48
BIRNDL 2016 Joint Workshop on Bibliometric-enhanced Information Retrieval and NLP for Digital Libraries
for AJT model to find latent meaning, AJT model suggests a new way to detect leading
authors and journals in sub-disciplines represented by discovered topics in a certain
field. Achieving this is not feasible by traditional frequency-based citation analysis.
One of the interesting observations is that the top-ranked journals in the discovered
topics derived from AJT model are not ranked top in terms of JCR. For example, the
“Oncotarget” journal is the top-ranked journal in three topics in our analysis, but the
ranking of the journal is 20 according to JCR. Since we only report on preliminary
results of our approach, we undertake in-depth analysis to investigate why this differ-
ence exists. We also conduct various statistical tests on the results. Based on the re-
ported results in this paper, though, we claim that AJT can be used for discovering
latent meaning associated citation sentences and the major players leading the field.
As a follow-up study, we will conduct a comparative study that compares the pro-
posed approach with the general topic modeling technique such as LDA. We also plan
to investigate whether there is a different impact of using citation sentences and general
meta-data such as abstract and title for topic analysis on facilitating idea sharing and
scholarly communication. In addition, we would like to consider the window size of
citation sentences enriching citation context and to discover the authors’ relationships
among the neighboring citation sentences.
5 Reference
1. Garfield, E. (1955). Citation indexes for science: A new dimension in documentation
through association of ideas. Science, 122(3159): 108–111. doi: 10.1126/ sci-
ence.122.3159.108.
2. Moravcsik, M. J., & Murugesan, P. (1975). Some results on the function and quality of ci-
tations. Social studies of science, 5(1), 86-92.
3. Zhang, G., Ding, Y., and Milojević, S. (2013). Citation content analysis (cca): A framework
for syntactic and semantic analysis of citation content. Journal of the American Society for
Information Science and Technology, 64(7): 1490-1503.
4. Ding, Y., Zhang, G., Chambers, T., Song, M., Wang, X., and Zhai, C. (2014). Content‐based
citation analysis: The next generation of citation analysis. Journal of the Association for
Information Science and Technology, 65(9): 1820-1833.
5. Tang, J., Zhang, J., Yao, L., Li, J., Zhang, L., & Su, Z. (2008). Arnetminer: extraction and
mining of academic social networks. In Proceedings of the 14th ACM SIGKDD interna-
tional conference on Knowledge discovery and data mining (pp. 990-998). ACM.
6. Dietz, L., Bickel, S., & Scheffer, T. (2007). Unsupervised prediction of citation influences.
In Proceedings of the 24th international conference on Machine learning (pp. 233-240).
ACM.
7. Nallapati, R. M., Ahmed, A., Xing, E. P., & Cohen, W. W. (2008). Joint latent topic models
for text and citations. In Proceedings of the 14th ACM SIGKDD international conference
on Knowledge discovery and data mining (pp. 542-550). ACM.
8. Kataria, S., Mitra, P., & Bhatia, S. (2010). Utilizing Context in Generative Bayesian Models
for Linked Corpus. In AAAI (Vol. 10, p. 1)
9. Steyvers, M., Smyth, P., Rosen-Zvi, M., & Griffiths, T. (2004). Probabilistic author-topic
models for information discovery. In Proceedings of the tenth ACM SIGKDD international
conference on Knowledge discovery and data mining (pp. 306-315). ACM.
49
BIRNDL 2016 Joint Workshop on Bibliometric-enhanced Information Retrieval and NLP for Digital Libraries
10. Tang, J., Jin, R., & Zhang, J. (2008). A topic modeling approach and its integration into the
random walk framework for academic search. In Data Mining, 2008. ICDM'08. Eighth IEEE
International Conference on (pp. 1055-1060). IEEE.
11. Di Marco, C., & Mercer, R. E. (2004). Hedging in scientific articles as a means of classifying
citations. In Working Notes of the American Association for Artificial Intelligence (AAAI)
Spring Symposium on Exploring Attitude and Affect in Text: Theories and Applications,
50-54.
50