Introduction

Computing Interdisciplinarity of Scholarly Ob jects using an Author-Citation-Text Model

Min-Gwan Seo

Seokwoo Jung

Kyung-min Kim

kimdarwin@kaist.ac.kr 0

Sung-Hyon Myaeng

myaeng@kaist.ac.kr 0 0 Korea Advanced Institute of Science and Technology , 291 Daehak-ro (373-1 Guseong-dong), Yuseong-gu, Daejeon 305-701 , South Korea

2017

62 72

There has been a growing need to determine if research proposals and results are truly interdisciplinary or to analyze research trends by analyzing research papers, reports, proposals and even researchers. In this paper, we tackle the problem and propose a method for measuring interdisciplinarity of scholarly objects. The newly proposed model takes into account authors, citations, and text content of scholarly objects together by building author networks, citation networks and text models. The three types of information are mixed by building network embeddings and sentence embeddings, which rely on the network topology and context-driven word semantics, respectively, through neural network learning. In addition, we propose a new measure that considers not only evenness of disciplines but also distributions of the magnitudes of disciplines so that saliency of disciplines is well represented.

Interdisciplinary research Document classi cation Network embedding Document embedding Scientometrics

Introduction

There has been a growing need to determine if research proposals and reports are truly interdisciplinary. For example, when government funding agencies and universities want to promote interdisciplinary research, they need to search for re-search proposals and reports for the degree of interdisciplinarity [2, 3, 11, 12]. In an e ort to address this issue, some past studies sought to apply new measures borrowed from ecology and economics [ 1, 12 ], but without modeling documents for their contents and inter-document relations.

In this paper, we tackle the problem and propose a method for measuring an interdisciplinarity (ID) score of scholarly objects such as individual documents, journals and conference proceedings. The newly proposed model takes into account authors, citations, and text content of scholarly objects together to determine the disciplines they represent. A distribution of disciplines is computed by building an author network, a citation network and text models. In addition, we propose a new measure that considers not only the number of disciplines but their magnitudes so that saliency of disciplines can be considered. 2 2 2.1

The overall process of calculating an ID score for interdisciplinarity of a scholarly object is carried out as follows. The input article, for example, is analyzed not only for its textual content but also for its citations and authors so that they are projected onto the corresponding networks built for the entire collections. Essentially three vectors are constructed for the three aspects of the scholarly object and combined to form its feature vector, which is fed into a classi er that determines its subject categories. The result is a distribution of subject category (or discipline) strengths. The next step is to compute an ID score.

Related Work Detecting Disciplines of Scholarly Objects

The way scholarly object is viewed and analyzed is a key to determining interdisciplinarity of a scholarly object. However, most of the previous studies simply used citation counts to get a distribution of discipline strengths. The work in [11, 5] calculated interdisciplinarity for journals in Web of Science (WoS). They assigned journals to 244 subject categories de ned by WoS and then grouped the subject categories into 6 macro-disciplines. To obtain a distribution of disciplines for a journal, they used article citation information in journals. In order to identify the names of journal embedded in the reference text, they applied simple text processing tools. [ 1 ] proposed a method focusing on interdisciplinarity of research teams. To compute a distribution of disciplines for a team, they counted the disciplines to which the researchers in the team belong. They used all the categories/disciplines of the PhD groups and 205 department groups. In order to analyze interdisciplinarity of target journals, the work in [9] used various types of bibliometric indicators and 225 subject categories de ned by ISI in its SciSearch database. 2.2

Interdisciplinarity Measures

Previous studies borrowed the concept of interdiciplinarity from the diverse sources like ecology and economics [ 1, 12, 14 ] where diversity is de ned with three di erent attributes: variety: the number of distinct categories (disciplines) balance: the evenness of the distributions of distinct categories disparity: the degree to which the categories are di erent

Table 1 shows interdisciplinarity measures de ned previously. Any two categories (or disciplines) i and j are represented with their proportions pi and pj in the system, while di;j and si;j measure the distance and the similarity between the two. The category count measure uses only a single attribute, but others consider multiple attributes. The Shannon entropy, Simpson index, and Stirlings diversity measures are most frequently used in computing interdisciplinarity in the literature. To the best of our knowledge, however, no studies have compared the di erent measures for their relative merits with real life data. 3

Name Attribute Category count Variety Shannon Entropy Variety / Balance

Simpson Variety / Balance Total Dissimilarity Disparity Stirling's diversity Variety / Balance / Disparity Form

N PiN=1 pilogpi PiN=1 pi

2 Pi;j (1 si;j ) Pi;j di;j pipj

Proposed Methods Author-Citation-Text Model

The ACT (Author-Citation-Text) model combines information obtainable from three sources to compute a distribution of disciplines for a scholarly object: the author network that captures co-authorships among the articles in the collection, the citation network that connects articles based on direct citations, and the text (abstracts in this work) of the articles. In short, an article is represented jointly with the local text and its relative positions in the global networks (i.e. author and citation networks).

Citation information is important in capturing the extent to which the article cites others under di erent disciplines. The more articles under di erent disciplines are cited, the more interdisciplinary the target article would be. However, citation information alone may not be su cient because the citation network is constructed based on explicit citations. For example, there may not be an explicit citation for text content under a di erent discipline if it is too old to cite or su ciently well known to the community. In order to compensate for this limitation, we consider co-authorship relationships and the semantics of text. The latter is important because di erent disciplines would have developed their own vocabularies, which can serve as discriminative features for a classi er.

In the proposed model, the three di erent types of information for articles are used to represent each article as a vector comprised of three corresponding vectors. An article vector then becomes an input to a research category classi er which returns a vector whose elements correspond to available research categories (19 in the current system), which is also referred to as a distribution of disciplines. We built two classi ers: one based on a neural network with two hidden layers and the other based on logistic regression.

Vectors corresponding to author, citation and text information are generated with embedding methods. Given a citation network connecting all the articles, where a node and an edge represent an article and a citation relation, respectively, we generate a vector for each article by employing a network embedding method that considers node connectivity information in a graph structure [10]. We employ the same method for an author network where a node and an edge represent an author and a co-authorship relation, respectively. A piece of text (an abstract in the current implementation) is also represented as a vector by 4 Algorithm 1 Proposed Interdisciplinarity Measure 1: procedure IDScore(D) 2: input: distribution of disciplines D 3: output: interdisciplinarity score based on the salient discipline set 4: H; L partition(D) 5: return Pi(pi + L1p)1log(pi + L1) jHj Pi;j di;j 8i;j 2 H employing a document embedding method [4], which is an extension of a word embedding method [8].

After the document/network embedding step, we obtain a vector v for an article, which is formed by concatenating the author vector a of dimension 64, the citation vector c of dimention 64, and the text vector t of dimension 300 for the article. A text vector of dimension 300 is trained by the doc2vec algorithm. An author vector is obtained by averaging the vectors for the authors who wrote the same article. Likewise, a citation vector is constructed in the same way. The network vectors are trained in advance by the deepwalk algorithm [10].

After obtaining an article vector v, we can now compute the distribution of disciplines for the article through a classi er. While any classi er would work for this purpose, we employed a neural network classi er and a logistic regression classi er. The classi er is pre-trained to predict the distribution of a given article vector v. 3.2

Proposed Interdisciplinarity Measure

Given a distribution of disciplines for an article or any scholarly object, we compute an ID score that captures the degree to which it is interdisciplinary. While Stirlings diversity introduced in Table 1 has been used widely used in previous studies of interdisciplinarity [6, 12], we propose a new measure that is not overly biased toward a large number of disciplines involved in a scholarly object.

Let us consider an article in bioinformatics, which is clearly an interdisciplinary area between biology and computer science. Even if the article just covers the two disciplines, it should not be considered less interdisciplinary than an article covering more than two areas just because of the number of areas. This is an indication that the number of disciplines is not a critical factor for interdisciplinarity as long as it contains at least two salient disciplines.

Based on this observation, the proposed measure attempts not to value the number of disciplines or consider all the expressed disciplines in calculating interdisciplinarity. Instead, it focuses on salient disciplines with high proportions in the distribution of disciplines. As a way of identifying salient disciplines, we introduce a partitioning method that divides disciplines based on their magnitudes in the distribution. The partition function is described in Algorithm 1 that shows the overall steps for computing an ID score. It selects salient disciplines based on one of the following two methods: largest gap and k-means with initialization. 5

The largest gap method sorts the disciplines in a distribution by an descending order of their magnitudes and then calculates the di erences between two neighboring disciplines. Disciplines are partitioned into two by drawing a line between the two disciplines showing the highest di erence. The k-means with init methods clusters the disciplines into two groups (i.e. k=2) based on their magnitudes as their unique attributes so that we can separate the disciplines with high magnitudes (or salient ones) from those with small magnitudes (or not-so-salient ones). To prevent the randomness of k-means, we set the two initial points with the biggest and smallest values.

After the partitioning, the following formula is used to calculate the ID score from the salient disciplines.

Pi(pi + L1)log(pi + L1) p1 jHj Pi;j di;j 8i;j 2 H where H and L correspond to salient and not-so-salient partitions, respectively. In the formula, pi is a proportion of the i-th discipline in a salient partition, and therefore p1 is the biggest proportion in the salient partition. Likewise, L1 is the biggest proportion in the not-so-salient partition. The distance di;j between the disciplines i and j is consider for every pair. Various types of distance measures such as Euclidean distance and Cosine distance can be used for the distance term.

This formula includes three key factors. First, we consider the size of the salient disciplines (jH j) for diversity. Second, we use the distance among the salient disciplines, which is computed with the sum of distances between all disciplines in the salient set (di;j ). Third, the log term is a modi ed entropy that focuses on the degree of integration among salient disciplines where L1 is the biggest value from other discipline set. This is used to add the information of the not-so-salient disciplines. This modi ed entropy is divided by the biggest value from the salient discipline set (p1) for normalization.

Basically, its log term is based on the entropy within the salient discipline set; this score increases when the distribution of salient disciplines are even. And because of the distance term (di;j ), this score increases when the disciplines in the salient set are di erent. The score also increases if there are multiple disciplines considered as salient because of the size term (jH j). 4 4.1

Evaluations Dataset

We used bibliographic data of research articles from Microsoft Academic Search (MAS ) which includes meta-data such as titles, authors, elds of study, references, keywords, and abstracts. From this meta-data, elds of study contains keywords that show the subject elds of the paper, such as 'physics' or techniques such as 'logistic regression' and 'wireless network.' The values under keywords are automatically extracted from the title and abstract of a given paper. We collected data from 1994 to 2015 for every three years, mainly due to the lack of storage and computing time. For more qualitative analysis involving human judgments, we only used 2015 data as older articles tend to have more missing meta-data.

To construct a gold standard for training and testing, we assign the 19 discipline elds to all the articles. That is, the task of the proposed method as well as some baselines is to assign discipline labels as close to the gold standard as possible. Because the articles in MAS do not have discipline eld labels, we devised a method for automatically assigning labels. This discipline eld labeling method compares those of elds of study in each article against the 19 major discipline labels and 268 sub-discipline labels pre-de ned by MAS. For the comparison, we used exact string matching between each value under elds of study and each discipline. Figure 1 shows the overall process for automatic paper labeling.

Pre-de ned major discipline labels are as follows: Art, Biology, Business,

Chemistry, Computer Science, Economics, Engineering, Environmen

tal Science, Geography, Geology, History, Materials Science, Mathematics, Medicine, Philosophy, Physics, Political Science, Psychology,

Sociology

Because the keywords part includes automatically extracted words from the paper, using them for labeling can assign irrelevant labels to the paper. Therefore, we decide to use the values of elds of study for conservative labeling. The matching discipline name or the parent of the matching sub-discipline name was assigned to the article. After the labeling step, an article can contain multiple discipline labels if its elds of study have multiple names.

Table 2 shows the number of unique authors and articles. From the table, the whole articles are collected from MAS. The target articles are obtained after

Year 1994 1997 2000 2003 # of unique authors 2,345,006 2,459,321 2,561,151 2,809,515 # of target articles 141,367 208,253 359,657 487,069 # of whole articles 1,697,000 1,748,000 1,769,000 1,921,000

Year 2006 2009 2012 2015 # of unique authors 3,026,501 3,218,836 3,453,158 3,608,795 # of target articles 681,916 800,585 786,753 593,515 # of whole articles 1,818,000 1,698,000 1,663,000 1,648,000 eliminating the articles without citation and/or abstract elds from the whole articles. In order to evaluate the proposed method, we used author network, citation network and text embeddings as features for logistic regression and neural network classi ers. The proposed ACT model was compared against its sub-components, namely citation (C) only and a combination of citation and author information (AC), which have been used previously. We chose to use a logistic regression and neural network classi ers because the former is well-known for general e ectiveness and the later for its popularity based on high performance. For the neural network classi er, we used two hidden layers with 256 and 128 nodes. Following well-known parameter settings, we employed Recti ed Linear Unit (ReLU) for the activation function and Categorical cross-entropy for the loss function. AdaDelta was used as an optimizer, and a simple SGD with 128 mini-batch was used for gradient updates.

For the logistic regression classi er, we used cross-entropy for a loss function and L2 penalty for regularization. This classi er was trained under the one-vsrest scheme for the multi-label problem. For all the learning and testing, we used 10-fold cross validation with all the collected articles. For each cross validation step, we train classi ers by using author, citation, text information and labels from papers in the training set and predict the distribution of papers in the test set.

Because the output of the classi er is a distribution of the disciplines whereas each of the article labels are treated as a binary value, both label-oriented and distribution-oriented evaluations were conducted. For the former, we de ne label precision (LP) as follows:

LP (X; Y ) =

PjiY=j0 1Y (xi;dis) jY j where X is a set of disciplines sorted in a descending order of the magnitude and Y is a set of disciplines whose label is 1. 1Y (xi) is an indicator function that returns 1 i xi 2 Y otherwise returns 0. xi;dis is the discipline of i-th element

Classi er Type Logistic Regression Neural Network Feature Combinations C AC ACT C AC ACT Average Label Precision 0.8103 0.8097 0.8478 0.8332 0.8250 0.8497

Average JS-divergence 0.2852 0.2853 0.2525 0.1532 0.1560 0.1313 in X. Label precision measures the extent to which the automatically assigned labels are true discipline labels are highly ranked in the predicted discipline distribution. For the distribution-oriented evaluation, we use Jansen-Shannon divergence (JSD) [7], which measures dissimilarity between two distributions. Hence, the lower JSD, the better the predicted discipline distribution.

Table3 shows average LP and JSD results over all the evaluated articles. ACT shows the best performance in both of the classi ers, con rming the superiority of the proposed method compared to the baseline of using citations only. It should be noted that using citation information alone through the network embedding, i.e. making use of the disciplines of the cited articles, already gives a relatively high performance (0.8332% in LP and 0.1532 in JSD with the neural network classi er).

It is also worth noting, however, that when the author information is added to the citation information (AC), the performance decreases slightly in both LP and JSD. Our analysis indicates that the author embeddings were not as e ective as citation embeddings because the author network built on the co-authorship relations is much sparser than the citation network. With a small network and low weights on edges caused by small co-authorship frequencies, the training through author embeddings gave an adversary e ect to the nal classi cation. Left for future research is a comparison against the use of author a liations like departments in a direct way, although it may have a bias because an inclusion of an author from a di erent disciplinary department may not make a research so interdisciplinary. 4.3

Evaluation under Di erent Interdisciplinarity Measures

This part of evaluation has dual goals: one is to validate the proposed measure against the previously used ones in determining interdisciplinarity and the other is to evaluate the proposed model under the di erent measures of interdisciplinarity. For more qualitative analysis, we constructed a ground truth based on human judgments of journals and conferences. We rst randomly selected 100 journals/conferences published in 2015 from a total of 1156 journals/conferences that contain more than 100 papers. The goal is to ensure that the selected journals/conferences should have enough data for the proposed ACT model.

Six human raters were asked to evaluate the interdisciplinarity of each journal/conference with scores ranging from 1 to 5 based on its introduction, aims, and research interest pages. The raters were given a guideline that speci es the

Logistic Regression number of disciplines covered, the evenness of the included disciplines, and their di erences for di erent ratings. The journals/conferences that did not obtain four or more votes for a particular score were excluded to ensure credibility of the test collection. As a result, only 75 journals/conferences remained with their scores for interdisciplinarity.

This ground truth data was used to evaluate di erent interdisciplinarity measures including the proposed one, under which the proposed model is also evaluated to see its superiority from di erent perspectives. We adopted the Spearmans rank correlation coe cient [13] between the human judges and computed scores to compare among the interdisciplinarity measures. For an interdisciplianrity value of a journal/conference computed with a particular measure, we took the mean of the values computed for all the articles in it.

In order to compute the distance between two disciplines di;j for the Stirling's measure and Salient measures, we created a discipline-discipline citation matrix X where Xi;j is the citation count from discipline i to j so that we calculate di;j = 1 cos(Xi; Xj ) where Xi is a citation vector from i th discipline to all other disciplines and cos(Xi; Xj ) is the cosine similarity.

Table 4 shows the Spearman correlation between the ground truth and the rankings under the interdisciplinarity measures for the cases of using C and ACT vectors and the two classi ers. Salient(lg) and Salient(km) mean the proposed interdisciplinarity measure with the largest gap and k-means with initialization methods, respectively. We show the results for only two di erent feature vector types because AC was known to be problematic in the previous experiment.

Most notable in the result is that regardless of the classi ers or the models, the proposed measure is most similar to the ground truth or the way human judges evaluate interdisciplinarity of the journals/conferences. While the guideline was geared toward the intended evaluation aspects included in the newly proposed measure, this result con rms that the new measure follows more closely the human raters qualitative judgements. It is also worth noting that the Stirlings measure is consistently superior to entropy, perhaps because of its consideration of diversity or the distance between disciplines.

The result also con rms the superiority of the proposed model under the di erent measures including the new one. The highest correlation 0.5732 was obtained with the ACT features and the neural network classi er. Between the two di erent ways of selecting salient disciplines, the one with k-means clustering was consistently superior. 5

Conclusions and Future Work

This paper addresses the problem of computing the degree of interdisciplinarity of a scholarly object. We identi ed two problems: one is the lack of a proper model for determining the disciplines represented by a scholarly object, and the other is the way interdisciplinarity is measured. For the rst one, we propose the Author-Citation-Text joint model that predicts the distribution of disciplines in a scholarly object based on the learned citation and author embeddings and document embeddings. For the second problem, we propose a new measure that takes into account saliency of disciplines appearing in a scholarly object.

From the experiment with a collection of articles over multiple years, the proposed model shows that the combination of the three aspects of articles can predict the discipline distributions more accurately. We also conducted a separate experiment for a more qualitative analysis by constructing a gold standard of 75 journals/conferences based on human judgments. Comparing the Spearmans correlation between human judgments and the automatically computed interdisciplinarity shows that the proposed measure captures the intended aspects of interdisciplinarity and that the proposed model is also superior under di erent measures.

The current work is novel in its tackling of the two key issues, modeling of scholarly objects for determining discipline distributions and measuring of interdisciplinarity. In addition, the construction of the two test collections for evaluations is also a signi cant contribution. However, we consider this work has several limitations. First, we did not fully explore di erent ways of using author information, other than just building an author network and author embeddings. Likewise, there is a plenty of room for considering di erent ways of combining citation and text information and even constructing di erent representations as the techniques for document and network embeddings are still in progress. Second, there is a room for improving the quality of the collections we developed for evaluations. While the way the discipline labels were attached to the individual articles is reasonable and quite accurate, it should be more carefully examined for accuracy perhaps by employing human experts or by means of crowdsourcing.

1. Aydinoglu , A.U. , Allard , S., Mitchell, C. : Measuring diversity in disciplinary collaboration in research teams: An ecological perspective . Research Evaluation 25 ( 1 ), 18 { 36 ( 2016 )