-

Learning Contextual Representations of Citations via Graph Transformer ⋆

Hyeon-Ju Jeon

hjjeon@kiaps.org 1 3

Gyu-Sik Choi

Se-Young Cho

Hanbin Lee

Hee Yeon Ko

hyeonju@cau.ac.kr myeongyeon.yi@navercorp.com 6

Jason J. Jung

O-Joun Lee

Myeong-Yeon Yi

4 0 Catholic University of Korea , Bucheon-si, Gyeonggi-do , Korea 1 Chung-Ang University , Dongjak-gu, Seoul , Korea 2 Incheon National University , Yeonsu-gu, Incheon , Korea 3 Korea Institute of Atmospheric Prediction Systems , Dongjak-gu, Seoul , Korea 4 NAVER Corp. , Seongnam-si, Gyeonggi-do , Korea 5 Sogang University , Mapo-gu, Seoul , Korea 6 Soongsil University , Dongjak-gu, Seoul , Korea

150 158

This study aims at representing the citation based on the citation context extracted from the citation network. Researchers cite papers for various purposes to describe their arguments in a logical structure. Thus, citations have diferent roles depending on what structure they are cited in the paper. In this paper, we first present a definition of the citation context and initialize the embedding vector based on the citation order and location. Then, based on the graph transformer model, we learn contextual citation embeddings. To represent citation context, we consider the following three parts: (i) textual features of paper, (ii) positional features of the citation context, and (iii) structural features of the citation network by applying the self-attention mechanism.

Citation Context • Citation Network • Graph Transformer • Network Embedding • Positional Embedding

The exponentially increment academic papers cause various services (e.g., citation recommendation [ 3, 7, 13 ], bibliographical retrieval [ 15 ], and so on). Such services need exquisite analysis of the scientific impact and content of papers [ 9 ].

There have been various studies [ 14, 16 ] on citation analysis to assess the quality of the paper and understand the context. These studies have mostly applied citation frequency-based and content-based approaches. The frequencybased approaches was only given the same weight regardless of the purposes of citation. As shown in Fig. 1, when two papers p1 and p2 are cited by pi, suppose that p1 is located in introduction section, and p2 is located in evaluation section. In this case, p1 and p2 are cited for diferent purposes, and their importance is also diferent.

To solve this problem, it is necessary to understand the overall context of the citation in the paper. The content-based approaches [ 6, 19 ] attempted to learn the contextual features of the paper using a language model based on RNN/LSTM. Nevertheless, these studies only concentrated on not discovering a citation context or their roles but measuring contents similarity between two papers.

Thereby, in this paper, we define and extract the citation context in citation networks. First of all, we assume that the cited papers compose the contents of the citing paper, and the order and location of the cited papers reflect the role of each paper in the citing paper. To represent citations, we propose an embedding method considering (i) textual features of paper, (ii) positional features of the citation context, and (iii) structural features of the citation network by applying the self-attention mechanism [ 18 ]. The proposed method can represent global citation features using fewer layers than the convectional GCN model. It is also eficient to learn the context of long papers.

Finally, based on the graph-transformer [ 20 ], the proposed method generates pre-training citation vectors considering the influence and correlation between citation papers. This result can be used in various tasks such as citation classiifcation, research topic discovery, and paper evaluation in the future. 2

Related work

This section introduces the existing methods for analyzing the citation relationship in the citation network. To deal with the large citation network, various studies investigated the co-citation frequency.

Boyack and Klavans [ 2 ] focused on the network theory which can measure node importance and weight to analyze co-citation relationship and bibliographic coupling. Although this approach reflects the feature of network structure level, it is dificult to say that the diferent roles of citations are considered. To solve this problem, Habib and Afzal [ 5 ] exploited the distribution of citations in sections to capture the citation context. Nevertheless, it is necessary to analyze the distinguishing characteristics of co-citation papers at the content level. The proximity based methods [ 4, 12 ] was proposed for weighting edges of the cocitation network by using contexts. The edge weight was based on the strength of co-citation context in the sentence level. Also, Ahmad and Afzal [ 1 ] showed that traditional co-citation analysis can produce better results when combined with metadata information of the paper (e.g., author, afiliation, venue, and so on.)

The above approaches focused on comparing content-based similarities in consideration of the relationship between cited papers. While these are efective for application to specific tasks such as citation recommendations and searches, it is dificult to generate widely used representation by unsupervised learning. Thereby, a few studies conducted network representation learning [ 11 ] for embedding the paper node based on the citation context in network structure level. VOPRec [ 10 ] learned vector representation of paper by combining text information with structural identity in the citation network. DocCit2Vec [ 21 ] which represents paper based on the citation context at the document level is used for the recommendation system by applying the attention mechanism.

However, it is dificult to consider the contextual features reflected in the structure of papers. From this perspectives, we extract the context of a citation through citation networks constructed according to the citation section. After that, initial embedding is performed considering the network structure and textual features so that the transformer model can learn various features of citations. In this section, we will introduce the detailed approach about the contextual citation embedding model. As illustrated in Fig. 2, the model composes three components: (1) extracting citation context, (2) initializing the citation embedding, (3) graph-transformer based encoder. Therefore, the graph transformer model learns a representation a target citation by fusing the input initial embedding vectors. To extract the context of the citation in the first component, we define our citation network as follows.

Definition 1 (Citation Network). The citation networks (N ) contains paper node (P). There are citation relationship (C ∈ R|P|×| P|) between paper nodes. When paper pi cites paper pj in the nth section, the citation relationship has weights (w ∈ {0, · · · , n, · · · , N }). This can be formulated as follows: N = ⟨P, C, w, t⟩ , (1) where t refers to a textual feature vector of P.

To consider the diferent compositions of the sections of the paper, we rearrange the paper into four sections from 0 to 3: 0 represents an introduction, 1 represents a related task, 2 represents a methodology, and 3 represents a result. In this case, the maximum section number is 3. Also, for the text features of each paper pi, diferent word embedding models can be used. 3.1

Extracting citation context

Instead of working on the entire citation network N , we extract the citation context from the citation network. The existing network embedding method uses a node sampling approach that is weighted according to the importance of the node. However, since the importance of cited papers is determined by the purpose of citations, analysis of the purpose and characteristics of each cited paper is necessary.

As stated in Sect. 1, we assume that the citation order and location of the paper relates to the purpose of the citation paper. Thus, we extract various subgraphs for the target paper by sampling the cited paper for each section rather than sampling the entire citation paper. In this section, we define the subgraphs as citation context; Definition 2 (Citation Context). Given an input citation network N , for paper pi in the network, citation context is a set of sampled paper at each section n ∈ [0, N ] This can be formulated as follows: Γ (pi) = ⟨Γ (pi,0), · · · , Γ (pi,N )⟩ , (2) where Γ (pi,n) represents the contextual citation in section n. This can be formulated as follows: Γ (pi,n) = {pj |pj ∈ P∖ {pi} ∧ w(i, j) = n}. (3)

To eficiently extract citation context for a batch of papers during the training of the embedding model, we extend a node sampling algorithm to enable node sampling for each section. The sampling method iteratively samples a list of papers for a target paper pi using adaptive sampling depth Kn by section. Let Spkin− 1 refer to the bag of papers sampled at the (k − 1)th step in nth section. For each paper node pi in Spkin− 1, we randomly sample cited papers in citation network with replacement from pi’s one-hop neighbors at the kth step. Through this process, the papers in pi’s citation context Γ (pi) can cover both local neighbors of pi and papers far away. 3.2

Initializing the citation embedding

Based on the citation context concept, we obtain the set of sampled subgraph batches for all the nodes as get G = {g1, g2, · · · , g|P|}, where gi represents the subgraph sampled for target paper pi. Diferent from general graph in which the nodes are orderless, in the paper, cited papers are logically constructed, so the order of citations is meaningful. Therefore, the citation context is serialized in the order cited in the paper. Formally, we concatenate the target paper pi and its ordered contextual citation g1, denoted by Ipi = [pi, pi,1, pi,2, · · · , pi,S ], where pi,j is the jth node in gj , and 1 ≤ j ≤ S. In this section, we define paper embeddings along the citation order quoted in a paper. The paper embeddings will be the input to the graph-transformer model.

For textual embedding, we can embed textual feature vector tj into a shared feature space for each paper pj ∈ I(pi) in the citation context gi. Simple fully connected layers can be used for the textual input. This can be formulated as follows: xtext(pj ) = Embedding(tj ) ∈ Rd, (4) xpos(pj ) = Embedding[p(j)] ∈ Rd, where p(j) indicates the position-id of paper pj in Ii.

Our main objective is to obtain the representation of the target paper pi based on the structural roles. To identify the role of each paper, we use the embedding method based on Weisfeiler-Lehman (WL) algorithm [ 17 ]. This can be formulated as follows:

xrole(pj ) = Embedding[r(j)] ∈ Rd, where r(t) refers to the role label.

After computing the three terms of embedding, we aggregate them to be the initial input paper embedding of the graph transformer model. The embedding fusion is formalized as follows: where d indicates dimension of the shared feature space.

The position of a paper in the citation context Ipi reflects the purpose and characteristic of the citation to the target paper pi. Thus, we suggest that the order of papers in Ipi is significant in learning citation representations. The following position-id embedding is used to identify the cited paper order information of an input list, (5) (6) (7) (9) x(pj ) = xtext(pj ) + xpos(pj ) + xrole(pj ) ∈ Rd.

We define the embedding fusion function as the summation of three embedding terms.

Finally, given a target paper pi, we obtain the initial paper embedding of each paper in its substructure cited paper set. The initial paper embedding for the paper in the citation context Ipi can be stacked to a embedding matrix. The embedding matrix is represented by X(pi) = [x(p1), x(p2), · · · , x(pS )] ∈ RS× d. 3.3

Graph-transformer based encoder

The target of the graph-transformer model is to aggregate the initial embedding of each paper and generate a low-dimensional embedding vector for each of paper. A numbers of attention layers are stacked to compose the transformer module. A single layer can be formulated as:

H(l) = attention H(l− 1) = sof tmax

Q(l)K(l)⊤ √d

V(l), (8) where H(l) and H(l− 1) denote the output embedding of the l and (l − 1) layer, Q(l), K(l), and V(l) are the query matrix, key matrix, and value matrix respectively, and d is the dimension of paper embedding. Specicfially, Q(l), K(l), and V(l) are calculated as follows: Q(l) = H(l− 1)WQ(l),  K(l) = H(l− 1)WK(l), V(l) = H(l− 1)WV(l), where WQ(l), WK(l), and WV(l) are the weight matrices of the lth attention layer.

The input of the graph-transformer model H(0) is denoted as the embedding matrix of the target paper X(pi). The output of the last attention layer H(L) is defined as the output paper embedding matrix Z of the transformer model. 4

Conclusion and future work

In this paper, we have proposed the learning representation of contextual citation network. We have defined the citation context by sampling a diferent number of papers per section. Using a graph transformer model, paper vectors were output based on salient citations within the citation context. According to our initial assumption, the results of the embedding model can reflect the role of each citations in the paper.

The citation purpose of a paper can change dynamically [ 8 ]. As future work, we can represent the paper with the meaning of citations that change over time. In addition, various bibliographic entities such as high reputed journals and authors afect the citation. If the graph transformer model is extended to heterogeneous networks in the future, rich interactions between bibliographic information are able to analyze. Finally, we intend to examine the proposed embedding model in a large contextual citation network.

Acknowledgements This work was supported by Korea Foundation for Women In Science, Engineering and Technology (WISET) grant funded by the Ministry of Science and ICT(MSIT) under the team research program for female engineering students. (WISET Contract No. 2021-178)

Jeon et al.

1. Ahmad , S. , Afzal , M.T.: Combining co-citation and metadata for recommending more related papers . In: 2017 International Conference on Frontiers of Information Technology (FIT 2017 ). pp. 218 - 222 . IEEE (dec 2017 ). https://doi.org/10.1109/fit. 2017 .00046

2. Boyack , K.W. , Klavans , R.: Co-citation analysis, bibliographic coupling, and direct citation: Which citation approach represents the research front most accurately ? Journal of the American Society for Information Science and Technology 61 ( 12 ), 2389 - 2404 (dec 2010 ). https://doi.org/10.1002/asi.21419

3. Cai , X. , Zheng , Y. , Yang , L. , Dai , T. , Guo , L. : Bibliographic network representation based personalized citation recommendation . IEEE Access 7 , 457 - 467 ( Dec 2019 ). https://doi.org/10.1109/access. 2018 .2885507

4. Eto , M. : Extended co-citation search: Graph-based document retrieval on a cocitation network containing citation context information . Information Processing & Management 56 ( 6 ), 102046 (nov 2019 ). https://doi.org/10.1016/j.ipm. 2019 . 05 .007

5. Habib , R. , Afzal , M.T.: Sections-based bibliographic coupling for research paper recommendation . Scientometrics 119 ( 2 ), 643 - 656 (mar 2019 ). https://doi.org/10.1007/s11192-019-03053-8

6. Huang , W. , Kataria , S. , Caragea , C. , Mitra , P. , Giles , C.L. , Rokach , L. : Recommending citations: translating papers into references . In: Chen, X. , Lebanon , G. , Wang , H. , Zaki , M.J . (eds.) Proceedings of the 21st ACM international conference on Information and knowledge management (CIKM 2012 ). pp. 1910 - 1914 . ACM Press, Maui, HI , USA (Oct 2012 ). https://doi.org/10.1145/2396761.2398542

7. Huang , W. , Wu , Z. , Liang , C. , Mitra , P. , Giles , C.L. : A neural probabilistic model for context based citation recommendation . In: Bonet, B. , Koenig , S. (eds.) Proceedings of the 29th AAAI Conference on Artificial Intelligence (AAAI 2015 ). pp. 2404 - 2410 . AAAI Press, Austin, Texas, USA (Jan 2015 )

8. Jeon , H.J. , Jung , J.J.: Discovering the role model of authors by embedding research history . Journal of Information Science 0 ( 0 ), 01655515211034407 ( 2021 ). https://doi.org/10.1177/01655515211034407

9. Jeon , H.J. , Lee , O.J. , Jung , J.J.:

Is performance of scholars correlated to their research collaboration patterns?

Frontiers in Big Data 2 ( 39 ) ( Nov 2019 ). https://doi.org/10.3389/fdata. 2019 .00039

10. Kong , X. , Mao , M. , Wang , W. , Liu, J. , Xu , B. : VOPRec: Vector representation learning of papers with text information and structural identity for recommendation . IEEE Transactions on Emerging Topics in Computing 9 ( 1 ), 226 - 237 (jan 2021 ). https://doi.org/10.1109/tetc. 2018 .2830698

11. Lee , O.J. , Jeon , H.J. , Jung , J.J. : Learning multi-resolution representations of research patterns in bibliographic networks . Journal of Informetrics 15 ( 1 ), 101126 (Feb 2021 ). https://doi.org/10.1016/j.joi. 2020 .101126

12. Liu , S. , Chen , C. : The proximity of co-citation . Scientometrics 91 ( 2 ), 495 - 511 (dec 2011 ). https://doi.org/10.1007/s11192-011-0575-7

13. Ma , S. , Zhang, C. , Liu , X.: A review of citation recommendation: from textual content to enriched context . Scientometrics 122 ( 3 ), 1445 - 1472 (jan 2020 ). https://doi.org/10.1007/s11192-019-03336-0

14. MacRoberts , M.H., MacRoberts , B.R.: The mismeasure of science: Citation analysis . Journal of the Association for Information Science and Technology 69 ( 3 ), 474 - 482 (nov 2017 ). https://doi.org/10.1002/asi.23970

15. Raamkumar , A.S. , Foo , S. , Pang , N.: Using author-specified keywords in building an initial reading list of research papers in scientific paper retrieval and recommender systems . Information Processing & Management 53 ( 3 ), 577 - 594 (May 2017 ). https://doi.org/10.1016/j.ipm. 2016 . 12 .006

16. Roman , M. , Shahid , A. , Uddin , M.I. , Hua , Q. , Maqsood , S. : Exploiting contextual word embedding of authorship and title of articles for discovering citation intent classification . Complexity 2021 , 5554874 : 1 - 5554874 :13 (apr 2021 ). https://doi.org/10.1155/ 2021 /5554874

17. Shervashidze , N. , Schweitzer , P., van Leeuwen , E.J. , Mehlhorn , K. , Borgwardt , K.M. : Weisfeiler-lehman graph kernels . Journal of Machine Learning Research 12 , 2539 - 2561 ( Sep 2011 )

18. Vaswani , A. , Shazeer , N. , Parmar , N. , Uszkoreit , J. , Jones , L. , Gomez , A.N. , Kaiser , L. , Polosukhin , I. : Attention is all you need . In: Guyon, I. , von Luxburg, U. , Bengio , S. , Wallach , H.M. , Fergus , R. , Vishwanathan , S.V.N. , Garnett , R . (eds.) Advances in Neural Information Processing Systems 30th Annual Conference on Neural Information Processing Systems (NIPS 2017 ). pp. 5998 - 6008 . Long Beach, CA, USA (Dec 2017 ), https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aaAbstract.html

19. Wang , J. , Zhu , L. , Dai , T. , Wang , Y. : Deep memory network with bi-LSTM for personalized context-aware citation recommendation . Neurocomputing 410 , 103 - 113 (oct 2020 ). https://doi.org/10.1016/j.neucom. 2020 . 05 .047

20. Zhang , J., Zhang, H., Xia , C. , Sun , L. : Graph-bert: Only attention is needed for learning graph representations ( 2020 ), https://arxiv.org/abs/ 2001 .05140, abs/ 2001 .05140

21. Zhang , Y. , Ma , Q. : Citation recommendations considering content and structural context embedding . In: Lee, W. , Chen , L. , Moon , Y. , Bourgeois , J. , Bennis , M. , Li , Y. , Ha , Y. , Kwon , H. , Cuzzocrea , A . (eds.) 2020 IEEE International Conference on Big Data and Smart Computing (BigComp 2020 ). pp. 1 - 7 . IEEE, Busan, Korea (South) ( Feb 2020 ). https://doi.org/10.1109/bigcomp48618. 2020 .0- 109