1. Introduction

texts: The role of uncertainty. Journal of Informet

Citation Intent Classification Through Weakly Supervised Knowledge Graphs

Xinwei Du

0 1 2

Kian Ahrabian

0 1 2

Arun Baalaaji Sankar Ananthan

0 1 2

Richard Delwin Myloth

0 1 2

Jay Pujara

0 1 2 0 Information Sciences Institute , Marina del Ray, CA , USA 1 University of Southern California , Los Angeles, CA , USA 2 [27] Da Zheng , Xiang Song, Chao Ma, Zeyuan Tan, Zi-

2019

2 2787 2795

Citations are scientists' tools for grounding their innovations and findings in the existing collective knowledge. They are used for semantically distinct purposes as scientists utilize them at diferent parts of their work to convey specific information. As a result, a crucial aspect of scientific document understanding is recognizing the authorial intent associated with citations. Current state-of-the-art methods rely on contextual sentences surrounding each citation to classify the intent. However, in the absence of textual content, these approaches become unusable. In this work, we propose a text-free citation intent classification method built on relational information among scholarly works in this work. To this end, we introduce a large-scale knowledge graph built from the publications in the SciCite dataset and their multi-hop neighborhood extracted from The Semantic Scholar Open Research Corpus (S2ORC). We also augment this knowledge graph by adding weakly-labeled links based on the intent information available in the S2ORC. Finally, we cast the intent classification task as a link prediction problem on the newly created knowledge graph. We study this problem in both transductive and inductive settings. Our experimental results show that we can achieve a comparable macro F1 score to word embedding content-based methods by only relying on features and relations derived from this knowledge graph. Specifically, we achieve macro F1 scores of 62.16 and 59.81 in the transductive and inductive settings, respectively, on the link-level SciCite dataset. Moreover, by combining our method with the state-of-the-art NLP-based model, we achieve improvements across all metrics.

eol>Citation Intent Classification Knowledge Graphs Graph Neural Networks Large Language Models Weakly supervised learning

1. Introduction

to textual information. Previous works [ 3, 26, 6 ] have shown the importance of relational and structural inforCitations are the primary way of identifying past contri- mation available in links among publications for various butions and connecting progress in new publications to tasks. In this work, we propose a general citation inexisting literature. Nevertheless, not all citations indicate tent classification method that relies purely on structural the same meaning. Authors use citations sparingly with information. specific intent behind them. For example, some papers Besides helping researchers better understand the reare cited for providing background information in a do- lationship among publications, citation intent analysis main, while others are cited when adopting or adapting has been used for studying various other aspects of sciena previously-used methodology. There are also scenar- tific works such as research domain evolution [ 10], scienios where the same paper is used as background infor- tific impact analysis [ 19], scientific document summarizamation and methodology use-case in diferent contexts tion [5], and retrieving related scientific works [ 16]. The simultaneously. Understanding citation intent is crucial main three categories of citations are “Result,” “Method,” to studying scholarly works, given the universality of and “Background” [4]. These categories describe the reausing citations. Current state-of-the-art citation intent sons behind making a scientific connection, referencing a classification models [ 17, 1, 4 ] rely heavily on textual publication in another publication. Classifying citations information, e.g., the sentences surrounding the citation. into these groups has traditionally required a high level However, such information is expensive to obtain and of expertise in the respective scientific domains. This in some scenarios inaccessible altogether. Consequently, constraint, combined with the high cost of expert human we need models that could operate without having access labor, has resulted in highly scarce datasets, which makes The Third AAAI Workshop on Scientific Document Understanding 2023, the task even more dificult.

February 14th, 2023, Washington, DC, USA Previous works have proposed classifying citation in* Corresponding author. tent through feature engineering-based [10] and repre$ xinweidu@usc.edu (X. Du); ahrabian@usc.edu (K. Ahrabian); sentation learning-based [ 1 ] methods. However, most (aRru.Dnb.aMayl@louthsc);.ejdpuuj(aAra. @B.uSs.cA.endaun(tJh. aPnu);jamray)loth@usc.edu of these methods depend on textual information. As a © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License result, they require a complex multi-stage pipeline of CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g ACttEribUutRion W4.0oInrtekrnsahtioonpal (PCCroBYce4.0e).dings (CEUR-WS.org) parsing documents, identifying citation contexts, and predicting citation intent [13]. Besides being prone to error propagation from various pipeline stages, the use of these models is limited to situations where the full text is available in a proper format. This work introduces a pure graph-based approach to classifying citation intent.

We extend the existing SciCite dataset with 2-hop neighborhoods extracted from The Semantic Scholar Open Research Corpus (S2ORC). To further enrich the graph, we utilize the intent information provided in the S2ORC to create a weakly supervised knowledge graph (KG) consisting of the publications and the relations that match the provided intents. Our main idea is to use contextualized relational patterns to make predictions, obviating the need for textual context. Given the newly built KG, Figure 1: Overview of the extracted multi-hop KG. The set of we cast the intent classification problem into the common 0n-ohdoepsno1deinsclu0dinecsluadlletshaellotrhaenogreanagnednboldueesn.Tohdeess.etSoimf1il-ahrolyp, link prediction problem on KGs. Specifically, we train a the graph could be expanded to include -hop nodes . The model to learn representations for entities and relations. annotated set on each edge represents that specific link’s Using these representations, we run the following query intent. Specifically, the empty set denotes that the citation on the KG: (, ?, ), where cites . link has no intent label.

Converting this problem into a link prediction task allows us to adapt and extend widely used KG embedding models to this problem. We study the link predic- going as far as defining 35 [ 7] and 12 [21] fine-grained tion problem in both transductive and inductive settings. schemes for scientific arguments. The more recent works Our experimental results show that although our KG- however have focused on creating more concise catebased method underperforms compared to the large lan- gories. For example, ACL-ARC [10] proposes a 6-class guage model-based approaches, it is comparable or even intent categorization scheme: Background, Motivation, superior to the word embedding-based methods. More- Uses, Extension, Comparison or Contrast, and Future. over, our experiments with combining the NLP-based and SciCite [4] is even more restrictive and drops or comgraph-based methods show slight improvements over the bines small fine-grained classes to provide a more concurrent state-of-the-art model. These findings further cise 3-class annotation scheme: Background, Method, signify the importance of relational patterns for citation and Result. intent classification.

The contributions of this work are as follows:

2.2. Citation Intent Classification Methods

1. Extending the SciCite dataset using the S2ORC dataset to generate a large-scale weakly supervised KG. 2. Introducing a novel graph-based approach for citation intent classification built on top of the newly built KG. 3. Presenting benchmarks for both transductive and

inductive settings. 4. Presenting analyses on the efect of diferent parts of the methodology such as weak supervision and feature engineering.

Before the explosion of deep learning approaches, most methods relied on a combination of hand-crafted features and classic machine learning models. For example, in one instance [ 23 ], authors propose 12 diferent features, including citation count, PageRank value, and author overlap, and use classic machine learning models such as SVM and Random Forest for classification. In another instance [10], authors define pattern-based, topic-based, and prototypical argument features and use SVM to make predictions.

With the advent of deep learning models and the emer2. Related Work gence of large language models in recent years, representation learning-based methods have outperformed the hand-crafted methods achieving a higher accuracy by 2.1. Citation Function/Intent Schemes considering the textual information. Recent works have Many prior works have studied the problem of creat- proposed the use of structural scafolds [ 4], BERT-based ing categorical schemes for citation intent which in some models trained on the scientific corpus (SciBERT) [ 1 ], works is referred to as citation function [9]. Earlier works word embedding-based approaches [17], and creating were focused on creating more fine-grained categories, a heterogeneous context graph based on an academic

KGs are structured information repositories consisting

of a set of nodes representing entities and a set of typed edges representing relations. Since, in most cases, the KG nodes and edges are not attributed, KG embedding (KGE) models aim to learn low-dimensional representations for all entities and relations. The most common 3.2. Dataset Splitting traditional shallow KGE methods are TransE [ 2 ], Com- The original SciCite dataset contains 11,020 humanplEx [22], and RotatE [20]. More recent GNN-based KGE labeled samples. Hence, to adapt it to our link predicmethods leverage the message-passing scheme of GNNs, tion setting, we reconstruct two datasets: SciCiteorigin enabling more complex multi-hop reasoning. Examples and SciCiteresplit. SciCiteorigin adheres to the same benchof these methods are GCN [11], which leverages the marks reported in prior works but is modified to remove spectral information for information propagation but overlapping citation links in the training and test sets. is limited to mono-relational KGs, R-GCN [18], which To maximize usage of the training data while removing extends GCN to support multi-relational KGs, and Graph- artifacts, we create SciCiteresplit that performs additional SAGE [8] which introduces an inductive framework to cleaning, provides a stronger separation of training and handle unseen nodes. test sets, and avoids multi-intent citations. Table 1 showcases the statistic of these datasets.

3. Dataset

SciCiteorigin:

3.1. Entity Mapping We first map each paper in the SciCite dataset to the

S2ORC by matching SciCite’s IDs to Semantic Scholar’s SHA IDs. Since a publication could have many SHA IDs and only one Corpus ID, we then map each SHA ID to the unique Corpus ID to extract unique entities.

From the 13,080 papers with unique IDs in SciCite, we successfully map 13,019 of them to valid SHA IDs in semantic scholar, while the remaining 61 papers do not have any corresponding records. We believe this is due to publication removals, as the SciCite dataset was created from the S2ORC in 2019. After converting SHA IDs to Corpus IDs, we end up with 13,011 unique entities and 8 duplicate entities. The SciCite dataset focuses on individual citation links and ignores the significance of broader relational connections and features. To overcome this issue, we construct a knowledge graph by mapping each entity in the SciCite dataset to the S2ORC and adding their 2-hop citation neighborhoods. The S2ROC contains more than 206 million publications and 2.49 billion citation links. Apart from the regular citation links, this corpus provides partial intent labels for citations using a 3-class scheme as follows: To make methods comparable, we use the same validation and test sets as SciCite for this dataset and try to keep the training set as close as possible. We convert each publication in the SciCite dataset to a Semantic Scholar entity using the mapped Corpus IDs and drop the contextual sentence-level information. We assign a random unique ID to publications without a Corpus ID. After this procedure, we end up with a set of links for our link prediction task.

Due to the removal of the contextual information, 1. Background: Describe a problem, topic, or con- some of the training links appear exactly the same in cept the test set. Hence, we remove 641 training set samples 2. Method: Provide a method, tool, or dataset that also appear in the test set to prevent data leakage. 3. Result: To make a comparison Moreover, since only one link in the test set has multiple intents, we treat the link prediction problem as a multi-class task rather than a multi-label task. In this scenario, the multi-intent links are represented as separate samples with the same inputs and diferent outputs.

Multi-label methods may be a promising future extension of our work. Given the S2ORC dataset, we expand the SciCite dataset

using the mapped entities to construct a KG containing SciCiteresplit: 2-hop neighborhoods of the publications. Figure 1 illusEven though we convert the SciCite dataset to the trates an overview of the expanded KG. This work uses SciCiteorigin, problems, such as duplicate citations and the 2022-09-13 version of the corpus downloaded from multi-label links, still exist. Therefore, we further tai- the bulk API. Formally, given the set of mapped entities lor the SciCite dataset to create a better link prediction 0, the set of -hop nodes is defined as dataset for graph-based models. First, we remove all the entities, and their related samples, that do not have a = − 1 ∪ { | ∃ ∈ − 1 : ∈ } (1) mapped Corpus ID. Then, similar to SciCiteorigin, we con- where for a given entity , denotes all the entities vert the remaining samples to a set of links. Following that cite or are cited by , i.e., the set of neighboring this, we drop all duplicate samples. Among the remaining entities. Given the sets of unlabeled links and weakly 6,458 unique links, 5,886 only have one intent, 489 have labeled links ℒ, the set of -hop edges ℰ is defined as two intents, and 83 have all three intents. We remove all the multi-intent links and resplit the dataset with ra- ℰ = {(, , UNK) | , ∈ , (, ) ∈ } (2) tios of 70%/15%/15% for training, validation, and test sets, respectively. ℰℒ = ∪{(, , ) | , ∈ , (, ) ∈ ℒ} (3)

The specific statistics of the extracted KG and the origi

4.1. Weak Supervision nal semantic scholar corpus are reported in Table 2. Since not every link has weakly labeled intent, this table also In order to enrich our data and provide more informa- provides the percentage of weakly labeled links for each tion to the models, we extract the set of intents provided corresponding graph. Although we extract 2, given its in the S2ORC dataset for each citation link. The intent scale, we opt to run our current experiment only on 1 labels in S2ORC are extracted using the structural scaf- and leave the larger-scale experiments for future works. folds model [4] at a sentence level. In this scenario, we implicitly use the existing data derived from the con- 4.3. Feature Engineering tent for bootstrapping our approach. We refer to these links as weakly labeled due to being labeled by a noisy model rather than a human expert. Since the intent labels are partial at a sentence level, citation links could have zero intent in the absence of text or several intents in an abundance of use cases.

Since none of the publications in our KGs have any features or pre-defined representation, we propose to represent them through their references, citations, and graphbased features. More specifically, from S2ROC we extract the in-degrees and out-degrees of citations (or references), background links, method links, and result links. As a result, each paper is represented with an 8-dimensional feature vector, 4 for each in-degree and out-degree feature.

4. Method

Throughout the rest of this work, for simplicity, we use the term publication to denote all types of academic publications, e.g., books and papers. Moreover, we use the terms citation and reference to denote incoming and outgoing links, respectively. ℰ = ℰ ∪ ℰ ℒ

(4) where ∈ {Background, Method, Result} and ℒ denotes the set of all weakly labeled links with label . Consequently, given the sets of -hop nodes and edges ℰ, the extracted -hop KG, , is defined as = (, ℰ) (5) Intent classification results on SciCiteorigin and SciCiteresplit datasets. All the metrics are macro averaged. Bold values represent the highest performance within the metric and dataset scope. = − 0.9 to get a normalized value of − 1 for zero

Moreover, we normalize the non-zero in-degree intent

based features into a [ 0, 1 ] probability distribution as follows: ℎ¯ =

ℎ ℎBackground + ℎMethod + ℎResult The same normalization step is used for out-degree features separately. scores, we concatenate the vectors of two entities and where ∈ R contains the unnormalized logits for each class. The predicted class is then calculated as = MLP([‖]) argmax sigmoid().

(8) (9) ℎ(+1) = 1

∑︁ ℎ() || ∈ ℎ(+1) = ( (+1)[ℎ()‖ℎ(+ 1)]) pass that through an MLP to get logit values. Formally, combination of the neighboring nodes’ representations. given a link (, ) and their respective learned represen- Let ℎ(0) be the extracted feature vector for any arbitrary tation (, ), we calculate the logit values as node . We calculate the representation of an arbitrary node at layer + 1 of a multilayer model as (10) (11) (12) Natural Language Processing Models:

We include the reported results of several state-of-the-art

Natural Language Processing (NLP) methods. Specifically, we include results from the word embedding-based methods such as Infersent-KMeans, Infersent-HDBSCAN, Glove-KMeans, and Glove-HDBSCAN [17], BiLSTMbased method Structural Scafolds [ 4], and large language model-based method SciBERT [ 1 ]. Moreover, we report the results of fine-tuning a pre-trained SciBERT model on both datasets. All these methods use textural information and are evaluated on the SciCite dataset.

The main disadvantage of the inductive settings is that

the unseen nodes only have one available feature, i.e., 4.5. Multi-Hop Link Prediction (MHLP) reference count. This absence of information makes the Transductive and inductive settings are the most common task extremely dificult, as the feature vectors are highly link prediction evaluating schemes for KGs. The main dif- sparse. However, our model tries to diminish this efect ference between these two settings is having a fixed set of by using the message-passing scheme, as defined in Equanodes in both the training and evaluation phases (trans- tion 11, to aggregate information through connected enductive) versus allowing the addition of unseen nodes tities, i.e., cited papers, creating a denser representation in the evaluation phase (inductive). This work refers to for the unseen nodes. citation intent prediction on unseen publications as the All our models are trained using the cross-entropy loss inductive setting, whereas the transductive setting refers defined as to Wciteatpioronpionsteenatnpareddaipcttaiobnleognraaplrhe-abdaysesdeemn opduebllifcoarticointas-. = − log ∑︀e|=x|p1(exp()) (14) tion intent prediction in both the transductive and inductive settings. The primary basis of this approach is where and is the logit value for class given the that a node, i.e., publication, could be represented as a prediction vector . where is a non-linear function. Throughout our experiments, we specifically use ReLU to introduce nonlinearity. Given the node representation from a -layer model and a link (, ), we calculate the logit values as

= MLP([ℎ()‖ℎ()]) where ∈ R contains the unnormalized logits for each class and is the set of all classes. The predicted class is then calculated as argmax sigmoid().

(13) (a) The number of diferent citation intents.

(b) The percentage of diferent citation intents. To further test the capabilities of our proposed model and use both structural and textual information, we devise a multi-modal model comprising encoders for both the graph structure and the citation context. Specifically, we use a pre-trained SciBERT model for encoding the citation phrase text and our MHLP model for encoding the citation graph around the citation link. Figure 2 illustrates an overview of the composite model.

5. Experiments

In this section, we report our experimental results on both of the SciCiteorigin and SciCiteresplit datasets. All the graphbased experiments are carried out on the 1 KG. For the traditional KGE methods, we tune their hyperparameters as described in Appendix A.1 and train them using the hyperparameters showcased in Table 4. For the hybrid methods, the KGE component is first trained to generate node features using the hyperparameters described in ported results [17], our model achieves superior performance to Glove-based models while slightly lagging behind Infersent-based models. Looking into the precision and recall comparison, our method has better precision scores on both transductive and inductive settings compared to all word embedding-based models; however, for recall, it performs better than Glove-based models and worse than the Infersent-based models which might stem from the imbalance in the links as illustrated by Figure 3a. Further experimentation to address the class imbalance problem in future works might help improve the overall performance of MHLP. The significance of these results is that we show structural and relational information could be used to achieve relatively high performance without using textual information. Moreover, although our models underperform compared to language model-based approaches such as Structural Scafolds and SciBERT, we showcase interesting future directions for combining graph-based and NLP-based methods.

Finally, the composite model denoted as SciBERT + MHLP in Table 3, achieves the best performance among all models, even beating the fine-tuned SciBERT. When considering MHLP’s standalone performance, these results showcase the potential improvements that could be achieved through the use of structural information that is not available in citation phrases. The presented experiments are a stepping stone for better understanding and using the structural information at scale for citation intent classification.

6. Analysis 6.1. Temporal Analysis

(a) Publication features (both sides) This analysis studies the relationship between the time (b) Averaged neighborhood features (both sides) that has passed since publication and citation intent. We hypothesize that a paper is more likely to be cited as Figure 4: The calculated MI values for publication features “Result” or “Method” right after its publication, and as and averaged neighborhood features. On average, the publicatime passes, it will be more likely to be cited as “Back- tion features show stronger connections to the target variable. ground.” If this is proven accurate, we could get a relatively strong signal from the temporal information for each citation. We plotted the years after publication analysis or studies of temporal information for citation against intent counts and ratios for all papers in the se- intent classification. mantic scholar corpus to test our hypothesis. Figure 3a and 3b illustrate the results of our analysis. As evident 6.2. Mutual Information Analysis from these figures and contrary to our original hypothesis, we find out that the ratio of intent classes almost stays In this analysis, we study the quality of the engineered the same as time passes with insignificant fluctuations. features as described in Section 4.3 concerning the weakly As a result, using temporal information in our models is labeled intent classes. To this end, we use the well-known unlikely to provide any significant improvement. Note mutual information (MI) [12] measurement to quantify that these results are based on the weakly labeled links the importance of each feature. Formally, the MI between that we obtained from S2ORC. Consequently, these links are generated by another noisy model that could potentially be biased. Hence, it should not discourage further (a) Features before normalization (b) Features after normalization (, ) = ∑︁ ∑︁ , (, ) log( ∈ ∈ , (, ) () () ) (15) where is the value space for , is the value space for , , is the joint probability distribution, and and are the marginal probability distributions. Note that MI is a non-negative value, and higher values indicate more correlation between the two random variables. For our analysis, we calculate MI for both sides of the 5,886 unique citation links in the SciCiteresplit dataset. Moreover, to study these features in the graph context, we also calculate MI for the average of these features over the neighborhood of each publication, i.e., all citing and cited publications, from both sides of the citation links. Figures 4a and 4b present the results of our experiments. As evident from these results, while the publication-averaged features generally show stronger connections to the target variable, the neighborhood-averaged features seem to show complementary connections, further emphasizing the importance of using both sets of features.

6.3. Feature Quality Analysis

In this analysis, we study the efect of normalization as described in Equations 6 and 7. To this end, we project the extracted features of the 5,886 unique citation links in the SciCiteresplit dataset to a 2-dimensional space using t-SNE [ 24 ]. Figure 5a and 5b illustrate the projected space for the unnormalized and normalized features, respectively. As evident from Figure 5a, it is challenging to distinguish diferent intent types in the unnormalized space. However, after normalization, as evident from Figure 5b, we can see that the “Method” intention more or less creates a distinguishable cluster. This result shows that the use of normalization is potentially helpful for the model. Further studies on diferent types of normalization and their efects are left for future work. 6.4. Robustness Analysis to augment the extracted citation graph with citation intents and create a multi-relational knowledge graph. FolIn this analysis, we focus on studying the robustness lowing this, we adapted the sentence-based intent classifiof our proposed graph-based method. To this end, we cation into a citation-based link prediction task on graphs. devise two ablation studies. In the first study, we ran- We then introduced a set of engineered graph-based and domly corrupt a percentage of the weak labels by replac- citation-based features. Built on top of these features, we ing the correct label with a random label. This study introduced a graph-based multi-hop reasoning approach aims to understand the model’s resilience to noise better. for the newly introduced task. Our approach achieves In the second study, we randomly remove a percentage 62.16 and 59.81 macro F1 scores in the transductive and inof the weak labels. This study’s idea is to understand ductive settings, respectively. The experimental results in better the efect of weak supervision on the model’s the inductive setting further showcase the robustness of performance. These studies are carried out by running the proposed approach in the information-deprived outthe MHLP method in the transductive setting on both of-distribution environment. Compared to NLP-based SciCiteorigin and SciCiteresplit datasets. models, we reached a comparable performance to, and

The feature vectors for the publications are calculated in some cases outperform, the word embedding-based by counting the number of citations and intents. These methods that rely on contextual sentences to make prevectors are normalized then using Equation 6 and 7. To dictions. Moreover, with a composite model comprising analyze the relationship between the model’s perfor- our method as the graph encoder and the state-of-the-art mance and the amount of available data, we create ten NLP-based model as the text encoder, we outperformed variations of the dataset by only using a portion of the all the other models we experimented with. These results available weak labels, varying from using all the available further signify the strong signal in relational informaweak labels to only using 10% of them. Figure 6a presents tion and highlight the importance of future analysis and the result of this study. studies in this domain. Finally, our presented analyses

As evident from Figure 6a, the more weakly labeled further support our methodological choices. links are available, the better our method performs. The For future works, one straightforward idea is to extend other significant observation is the robustness of the the knowledge graph with more scholarly information, model, even in the extreme scenario of having access to such as authors, venues, and fields of study. There already only 10% of the labels. Note that only 31.90% of links in exist some open repositories such as OpenAlex [15] and the S2ROC have at least one weakly labeled intent, which Microsoft Academic Graph (MAG) [ 25 ] that contain this means, even if the utilization percentage is 100%, only information. Another direction is further investigation 31.90% citation links are weakly labeled. into the temporal signals. Last but not least, although we

Figure 6b showcases the relationship between the achieved an improved performance through a fusion of model performance and the percentage of corrupted data. textual and structural information, more investigation Following our intuition, the model’s performance mono- and analysis could be done in this setting in future works. tonically decreases as we add more noisy labels to the data. However, two interesting observations could be made from this figure. First, the performance of our Acknowledgments method only drops less than five macro F1 scores when half (50%) of the weak labels are replaced with randomly This work was funded by the Defense Advanced Research assigned noisy labels. This observation shows that the Projects Agency with award W911NF-19-20271 and with proposed method is exceptionally resilient when faced support from a Keston Exploratory Research Award. with mistakes. Second, even when all the labels are replaced with random ones (100%), the model performs better than the random baselines. This observation indi- References cates that the model is learning to make inferences based on purely structural information, which further solidifies our hypothesis regarding the importance of structural information.

7. Conclusions and Future Work In this work, we first introduced an expansion to the Sci

Cite dataset by extracting scholarly information from the S2ORC dataset and creating an extended citation graph. Then, we gathered a large-scale weakly labeled dataset plex Embeddings for Simple Link Prediction. In Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48 (ICML’16). JMLR.org, New York, NY, USA, 2071–2080.

Identifying Meaningful Citations. In Scholarly Big

Data: AI Perspectives, Challenges, and Ideas, Papers from the 2015 AAAI Workshop (Technical Report,

WS-15-13), Cornelia Caragea, C. Lee Giles, Narayan Bhamidipati, Doina Caragea, Sujatha Das Gollapalli, Saurabh Kataria, Huan Liu, and Feng Xia (Eds.). AAAI Press, Menlo Park, CA, 21–26. http://www. Visualizing Data using t-SNE. Journal of Machine

http:

Chieh-Han Wu, Yuxiao Dong, and Anshul Kanakia. Heterogeneous Contexts. In Proceedings of The Web

ciation for Computing Machinery, New York, NY, USA, 962–972.

3380175 hao Ye, Jin Dong, Hao Xiong, Zheng Zhang, and of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’20). Association for Computing Machinery, New York, NY, USA, 739–748.

A. Hyperparameters A.1. Knowledge Graph Embedding

We use a randomized search to tune our models and ifnd near-optimal hyperparameters using the following ranges: embedding dimensions ∈ negative samples ∈ {64, 128, 256, 512, 1024}, on Discourse and Dialogue. Association for Com- Table 4 putational Linguistics, Sydney, Australia, 80–87. Hyperparameters of KGE algorithms. edge Graph Embeddings at Scale. In Proceedings George Karypis. 2020. DGL-KE: Training Knowl- ing ranges: embedding dimensions ∈ 1e-6 128 0 100 1e-6 512 50 1e-6 64 1 6 adversarial temperature and the margin value (RotatEonly), respectively.

A.2. Multilayer Perceptron To simplify the model tuning process, we find the optimal

hyperparameters of “ComplEx + MLP” on SciCiteorigin using grid search and reuse them for the rest of our experiments. Specifically, we run a grid search over the following ranges: number of layers ∈ {0, 1, 2, 3}, dropout ∈ {0, 0.1, 0.2, 0.3, 0.4, 0.5}, dimension ∈ {32, 64, 128},

The optimal hyperparameters are as follows: number of

layers = 2, dropout = 0.2, and dimension = [64, 32]. We use ReLU as the activation function for all layers.

A.3. Multi-Hop Link Prediction We run a grid search over the following ranges: number

learning rate ∈ {0.03, 0.01, 0.003, 0.001}, The optimal hyperparameters are as follows: number of layers = 1, dimension = 100, learning rate = 0.01. We use Adam as the optimizer through the tuning process.

We use a randomized search to tune our models and ifnd near-optimal hyperparameters using the followifcient learning rate ∈ negative samples ∈ {0.25, 0.5, 1},

[1]

Beltagy , Kyle Lo, and

Arman

Cohan . 2019 . SciBERT: A Pretrained Language Model for Scientific Text . In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) . Association for Computational Linguistics , Hong Kong, China, 3615 - 3620 . https://doi.org/10.18653/v1/ D19 -1371

[2]

Antoine

Bordes , Nicolas Usunier, Alberto

GarciaDurán

, Jason Weston, and

Oksana

Yakhnenko . Éric Gaussier, and

Guillaume

Bouchard . 2016 . Com-

[23] Marco

Valenzuela

, Vu Ha, and

Oren

Etzioni . 2015 .

[24] Laurens

van der Maaten and Geofrey

Hinton . 2008 .

Learning Research 9 , 86 ( 2008 ), 2579 - 2605 .

[25] Kuansan

Wang

, Zhihong Shen,

Chiyuan

Huang , 2020 . Microsoft academic graph: When experts are not enough . Quantitative Science Studies 1 , 1 ( 2020 ), 396 - 413 .

[26] Wenhao

Mengxia

Yu ,

Tong

Zhao ,

and Meng

Jiang . 2020 . Identifying Referential Intention with Conference 2020 (Taipei, Taiwan) (WWW '20) . Asso0.1 0.3 0.25 0 . 1