Introduction

An interactive visual tool for scientific literature search: Proposal and algorithmic specification

Juan Pablo Bascur

j.p.bascur.cifuentes@cwts.leidenuniv.nl 0

Nees Jan van Eck

Ludo Waltman

0 0 Centre for Science and Technology Studies, Leiden University , the Netherlands

2019

76 87

Literature search is a critical step in scientific research. Most of the current literature search tools present the search results as a list of documents. These tools fail to show the structure of the search results. To address this issue, we propose an interactive visual tool for searching scientific literature. This tool creates, labels and visualizes clusters of documents that may be of relevance to the user. In this way, it provides the user with an overview of the structure of the search results. This overview is intended to be understandable even to a user who has only a limited familiarity with the scientific domain of interest. We present the concept of our tool, show a case study of its use and describe the technical specifications of the tool. In particular, we provide a detailed specification of the algorithm that we use to visualize clusters of documents.

Scientific literature search Scatter/gather Packed bubble chart

Introduction

Literature search is an essential part of any research project. Many of the current literature search tools (e.g. Google Scholar [ 1 ], Web of Science [ 2 ], Scopus [ 3 ] and Dimensions [ 4 ]) present the search results as a list of documents, without showing the structure of the results. Getting an understanding of the structure of the results, for instance by providing a breakdown of the search results into different research topics, can be useful for exploring the literature [ 5 ], especially for making serendipitous discoveries or for users that are new to a field of research.

There is some literature studying the idea of showing the structure of search results. An example is the recent work on a tool called PaperPoles [ 6 ], which uses citation links to create clusters of related papers. Various tools have also been made publicly available, some of them with a clear focus on literature search and others with a primary focus on bibliometric analysis. For instance, CiteSpace [ 7 ], CitNetExplorer [ 8 ] and Citation Gecko [ 9 ] can be used to visualize networks of citations between documents. Open Knowledge Maps [ 10 ] shows clusters of semantically-related papers. VOSviewer [ 11 ] presents visualizations of co-occurrence networks derived from papers (e.g. co-authorship links between authors, citation links between documents, or co-occurrence links between terms).

While these tools are helpful, some of them (e.g. CiteSpace, VOSviewer) were developed primarily for bibliometric analysis, not for literature search. Others (e.g. CitNetExplorer, Citation Gecko) have the limitation of showing search results only at the level of individual papers, not at aggregate levels. To overcome the limitations of currently available tools, we propose a new tool for literature search. This tool uses an interactive visual interface to show the structure of the search results. We make use of ideas and techniques that we also used in the development of other tools (i.e., VOSviewer and CitNetExplorer), but we now focus specifically on literature search rather than on bibliometric analysis. To some degree, the proposed tool resembles Open Knowledge Maps. However, by relying on the scatter/gather approach [ 12 ], the tool offers a higher level of interactivity, which facilitates the exploration of large document spaces.

This paper is divided into three parts: We first provide a description of the proposed tool (Section 2), we then present a case study demonstrating the use of the tool (Section 3) and finally we give a technical specification of the algorithms included in the tool (Section 4). 2

Description of the tool

Our proposed tool is based on the scatter/gather approach [ 12 ]. This approach consists of exploring a set of documents through multiple iterations of scattering and gathering. To scatter means creating clusters of documents and labeling them to understand their contents. To gather means selecting the clusters of interest, resulting in a new set of documents (Fig. 1). The documents in our tool are scientific papers.

Our tool scatters a set of papers into clusters. The clustering uses the citation links between papers. Each cluster is given a label. The label of a cluster consists of the ten noun phrases with the highest weighted frequency in the titles and abstracts of the papers in the cluster. The weighting considers the frequency of occurrence of the noun phrases in the focal cluster relative to other clusters. This clustering and labeling method is based on Waltman and Van Eck [ 13 ].

Our tool also visualizes the clusters to complement the labels. It visualizes the clusters as bubbles in a packed bubble chart. The size of the bubbles reflects the number of papers in the clusters and the distance between the bubbles approximately reflects the number of citation links between the clusters.

Our tool supports multiple iterations of scattering and gathering. The user can load the initial set of papers, choose the clusters to gather, choose the number of clusters to scatter, retrieve the papers in the clusters, and so on. 3 3.1

Case study of the tool

Set up First, let us consider a user working with a traditional literature search engine for scientific literature, like Google Scholar. She has to come up with several search queries. She does not have a background in the academic field that she is looking into, so probably she will not come up with good queries. Also, she has no way to know if she is missing important papers or even entire subfields!

Second, let us assume instead that she uses a literature search engine that offers some very basic features for exploring the structure of the search results, like Web of Science. She can now see to which academic fields her search results belong. Despite of this, she still has basically the same problems as with Google Scholar.

Third, now let us assume that she uses our proposed tool for her literature search. For this example, we will follow her through all the steps of the search process. We will assume that she is interested in getting to know the scientific literature about the review process of grant proposals. For the initial set of papers, we will use the set of the cluster of scientometrics papers obtained using the algorithmic methodology employed at CWTS [ 13 ]. We believe that she would have used the same set because it covers her topic. 3.2

Example of the search process

The researcher retrieves the set of papers and chooses a value of 10 for the number of clusters in the first scattering. Then she sees the visualization (Fig. 2A) and the labels (Table 1) of the clusters. From the labels, she sees that her topic of interest is in cluster 6. She also checks the labels of the clusters close to cluster 6 (clusters 0, 3, 5, 8 and 9). Their labels indicate that they do not relate to her topic of interest, so she only gathers cluster 6.

She chooses to have 5 clusters for the second scattering and sees the visualization (Fig. 2 B) and the labels (Table 2) of the clusters. Now the labels are more ambiguous, so she will have to also read the titles of the papers inside clusters to understand what the clusters are about. She suspects that her topic of interest is in clusters 1 and 2. From the visualization and the labels, she also sees that her topic could be in cluster 4. She reads the titles of the top 5 most cited papers in these three clusters (Tables 3, 4 and 5). She finally decides that she should start reading paper 3 from cluster 1 and papers 2 and 4 from cluster 2.

In this example, we have illustrated how our tool could improve scientific literature search. The key advantage of the proposed tool is that the user is informed about the way in which the scientific literature is organized. For instance, the user is able to see how a field is divided into subfields or topics. As a result of this, the user is able to discard papers unrelated to the topic of interest without the need to skim the titles of large numbers of individual papers. Instead, the user examines the labels of clusters and then decides to discard entire clusters that appear to be of no relevance. Also, the user does not need to try to come up with a detailed keyword query that identifies exactly the right papers. It is sufficient to be able to identify a broad set of papers that could potentially be of relevance. Within this broad set of papers, the papers of interest can then be found by drilling down into the right clusters.

Papers 4344 3154 1652 1651 1231 1230 932 816 810 492 Papers 387 270 104 104 67 11 2013

JOURNAL OF INFORMETRICS

Technical specifications Clustering the documents

Cit. 23 12

Year 2013 2011 11

2012 10 2011 We cluster the papers by applying the Leiden algorithm to their citations links [ 13, 14 ]. The Leiden algorithm identifies clusters (or communities) of nodes within a network. We apply the Leiden algorithm to a directed network where the papers are the nodes and the edges are the citations between citing and cited papers. The Leiden algorithm has a resolution parameter that determines the number and size of clusters. To avoid requiring the user to set the resolution parameter manually, we developed a rule of the thumb that enables the user to specify the number of clusters C that she wishes. According to this rule, the resolution parameter is chosen in such a way that the largest cluster includes between N/(C-2) and N/C papers, where N is the total number of papers in the collection. To obtain the desired number of clusters after the clustering algorithm has been run, we keep the top C largest clusters and merge them with the other smaller clusters. We merge the pairs of clusters that have the highest relatedness, which we define as e(c1,c2)/(n(c1)*n(c2)), where c1 and c2 are the clusters, e(ci,cj) is the number of edges between two clusters and n(c) is the number of papers in a cluster. We label clusters using the approach developed by Waltman and Van Eck [ 13 ]. This approach extracts cluster labels from noun phrases in the titles and abstracts of the papers belonging to a cluster. It labels a cluster using noun phrases that are common in the cluster and relatively uncommon in other clusters. The only modification that we make to the approach introduced in [ 13 ] is that we report 10 noun phrases instead of 5. 4.3

Visualizing the clusters

We visualize clusters using a packed bubble chart. We developed an algorithm to create these charts (see below). The input of our algorithm is an undirected network. In this network, nodes represent clusters of papers, the weight of a node indicates the number of papers in a cluster, and the weight of an edge between two nodes indicates the relatedness of two clusters in terms of citation links.

4.3.1 Bubble chart algorithm

Our bubble chart algorithm determines the coordinates of the bubbles, where each bubble is a node in a network. The objective of our bubble chart algorithm is to obtain a visualization in which the bubbles do not overlap, the empty space is minimized, and the positions of the nodes relative to each other reflect their relatedness as accurately as possible. We base our algorithm on the VOS layout algorithm [ 15 ] used in the VOSviewer software, but we make modifications in order to avoid overlapping bubbles and to minimize the empty space.

The area of a node is proportional to the weight of the node. Therefore, the radius of a node is the square root of w, where w is the weight of the node. Nodes connected by edges with a high weight should be close together. To achieve this, we minimize a weighted sum of the squared Euclidean distances between all pairs of nodes, which is similar to the VOS layout algorithm [ 15 ]. The weighting considers the weight of the edges between pairs of nodes. This weighted sum can be understood as the stress V of the network layout, and our objective is to minimize this stress. Mathematically, the stress function V is given by where ri is the radius of the node i. Minimization of the stress function in Eq. 1 subject to the constraint in Eq. 2 is not straightforward, so we developed a minimization algorithm for it. where xi denotes the coordinates of node i in a two-dimensional space, || · || is the Euclidean norm, and sij is the weight of the edge between nodes i and j. To avoid overlapping nodes, we add for all pairs on nodes i and j the constraint (1) (2)

4.3.2 Minimization algorithm

The best strategy to minimize Eq. 1 while satisfying Eq. 2 in a network of two nodes (nodes 1 and 2) is to place the nodes adjacent to each other. When we fix the coordinates of node 1, the coordinates where node 2 can be placed form a circle c(1,2) around node 1 (Fig. 3A). This circle has a radius equal to the sum of the radius of node 1 and the radius of node 2. Now, we also fix the coordinates of node 2 and add node 3 to the network layout. We can use the same strategy to get its coordinates. The adjacent coordinates for node 3 form the circles c(1,3) and c(2,3) (Fig. 3B). Therefore, the available coordinates to place node 3 are the intersection points of c(1,3) and c(2,3) (Fig. 3C).

When we add node 4 to the network layout, the available coordinates for this node are no longer all the intersection points of the circles c(i,j), because some coordinates would cause nodes to overlap (Fig. 3D). Of the available coordinates, we select the ones that result in the lowest stress. We can find these coordinates by calculating the weighted sum of the squared Euclidean distances between node 4 and each node that has already been assigned to coordinates. We proceed in the same way for all other nodes.

Our minimization algorithm obtains the coordinates of the nodes by adding them one-by-one to the network layout. However, we found that the value of the stress at the end of an algorithm run is highly dependent on the order in which the nodes had been added. To improve our minimization algorithm, we added a step in which we create several lists of the nodes in a different order. For each list, we run the minimization procedure and in the end we return the network layout with the lowest stress.

We order the nodes in the lists as follows. For each node in the network, we create a list with that node as the first node. The next node in the list is the one that is most strongly related to the nodes already in the list. We repeat this process until all nodes have been added to the list.

Our minimization algorithm is a heuristic approach to the minimization of Eq. 1 and does not guarantee that the global minimum of Eq. 1 will be found. The pseudocode of the algorithm is provided in the appendix. 5

Conclusion

We have proposed a tool for scientific literature search based on the scatter/gather approach. The tool visualizes the structure of the search results using a packed bubble chart. We have presented a case study demonstrating the use of the tool and we have provided a technical specification of the algorithms included in the tool, in particular the algorithm for creating packed bubble charts.

Compared to traditional literature search tools that present the search results as a list of documents (e.g. Google Scholar), we expect the advantage of our tool to be in the emphasis it puts on showing the structure of the search results. We expect this to be important especially when users are searching not for one specific paper but for a larger set of papers offering a broad understanding of a certain scientific domain. In future work, we plan to test the performance of the tool for different information retrieval tasks.

Appendix

----INPUT: list INLIST containing nodes (x0,...,xn).

Each node possesses: A node identity id(x) A radius r(x) A list of edges E(x) containing (e0,....,en), with each edge e possessing a weight w(e) and an node identity id(e) of the node it connects to

A coordinate c(x) that contains nothing OUTPUT: list OUTLIST containing nodes (x0,...,xn) possessing non-empty coordinates c(x) ----Create list MASTERLIST containing nothing For each node xi in list INLIST (x0,...,xn):

Complete subroutine S_ORDER(xi,(x0,...,xn)) Create list Zi containing nothing Set coordinate c(xi0) of node xi0 as (0,0) Append node xi0 to list Zi Set coordinate c(xi1) of node xi1 as ((r(xi0)+r(xi1),0) Append node xi1 to list Zi Complete subroutine S_COOR(Zi,(xi2,...,xin))

Append list Zi to list MASTERLIST Return list OUTLIST in MASTERLIST (Z0,...,Zn), where OUTLIST is the list with lowest graph stress V as defined in the equation 1 V(OUTLIST) -----Subroutine S_ORDER creates an order of nodes S_ORDER(xi,(x0,...,xn)): Create list Xi containing nothing Append node xi to list Xi as node xi0 Create list Yi containing nodes (x0,...,xn) Remove node xi from list Yi While list Yi containing something:

For each node xj in Yi:

Declare twj is the total weight from xj to all the nodes in Xi Declare xtw is the node with greatest twj Append node xtw to list Xi as node xij

Remove node xtw from list Yi ----Subroutine S_COOR gets the coordinates of the nodes for nodes x>1 S_COOR(Zi,(xi2,...,xin)): For each node xij in (xi2,...,xin):

Create empty list TEMPij For each order-independent pair of nodes (xijm, xijn) in list Zi, where m > n:

Complete subroutine S_TEST(xij,xijm,xijn,Zi,TEMPij) Append node tempij to list Zi, where tempij is the temporary node with lowest node stress v in list TEMPij ----Subroutine S_TEST tests if the node xij can be adjacent to nodes (xijm, xijn), get the coordinates of center of these adjacent positions, test if the node xij on that coordinates overlaps with other nodes and get the stress of the node xij on that coordinates.

S_TEST(xij,xijm,xijn,Zi,TEMPij): Declare temporary node tempijm with coordinate c(xijm) and radius (r(xij)+r(xijm)) Declare temporary node tempijn with coordinate c(xijn) and radius (r(xij)+r(xijn)) If tempijm and tempijn DO overlap:

Declare coordinates coorijmn1 and coorijmn2 are the coordinates of the intersection between the borders of tempijm and tempijn For coorijmnk in list (coorijmn1, coorijmn2):

Declare temporary node tempijmnk is a node with the parameters of node xij, except that its coordinate c(tempijmnk) is coorijmnk If node tempijmnk DOES NOT overlaps with any node in Zi:

Declare node stress vijmnk is the total stress of the node tempijmnk with every node in the list Zi

Append tempijmnk to list TEMPij ----

Google

Scholar . https://scholar.google.com/, last accessed 2018 /01/27.

2. Web of Science. https://clarivate.com/products/web-of-science/, last accessed 2018 /01/27.

3. Scopus. https://www.scopus.com/, last accessed 2018 /01/27.

4. Dimensions. https://www.dimensions.ai/, last accessed 2018 /01/27.

5. Abbasi , M.K. , Frommholz , I. Cluster-based polyrepresentation as science modelling approach for information retrieval . Scientometrics , 102 ( 3 ), 2301 - 2322 ( 2015 ). doi: 10.1007/s11192-014-1478-1

6. He , J. , Ping , Q. , Lou , W. , Chen , C. PaperPoles: Facilitating adaptive visual exploration of scientific publications by citation links . Journal of the Association for Information Science and Technology ( 2019 ). doi: 10 .1002/asi.24171

7. Chen , C. , Ibekwe‐SanJuan , F. , Hou , J. The structure and dynamics of cocitation clusters: A multiple‐perspective cocitation analysis . Journal of the American Society for information Science and Technology , 61 ( 7 ), 1386 - 1409 ( 2010 ). doi: 10 .1002/asi.21309

Van

Eck , N.J. , Waltman , L. CitNetExplorer: A new software tool for analyzing and visualizing citation networks . Journal of Informetrics , 8 ( 4 ), 802 - 823 ( 2014 ). doi: 10 .1016/j.joi. 2014 . 07 .006

Citations

Gecko . http://citationgecko.com/, last accessed 2018 /01/27.

10. Open Knowledge Maps. https://openknowledgemaps.org/, last accessed 2018 /01/27.

11. Van Eck , N.J. , Waltman , L. Software survey: VOSviewer, a computer program for bibliometric mapping . Scientometrics , 84 ( 2 ), 523 - 538 . ( 2009 ). doi: 10.1007/s11192-009-0146-3

12. Cutting , D.R. , Karger , D.R. , Pedersen , J.O. , Tukey , J.W.

Scatter

/gather: A cluster-based approach to browsing large document collections . Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval , 318 - 329 ( 1992 ). doi: 10 .1145/133160.133214

13. Waltman , L., Van Eck , N.J. A new methodology for constructing a publication‐level classification system of science . Journal of the American Society for Information Science and Technology , 63 ( 12 ), 2378 - 2392 ( 2012 ). doi: 10 .1002/asi.22748

14. Traag , V. , Waltman , L. , Van Eck , N.J.

From

Louvain to Leiden: guaranteeing wellconnected communities . Scientific Reports , 9 , 5233 ( 2019 ). doi: 10 .1038/s41598-019- 41695-z

15. Van Eck , N.J. , Waltman , L. , Dekker , R., Van den Berg, J. A comparison of two techniques for bibliometric mapping: Multidimensional scaling and VOS . Journal of the American Society for Information Science and Technology , 61 ( 12 ), 2405 - 2416 ( 2010 ). doi: 10 .1002/asi.21421