<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>ACE: Big Data Approach to Scienti c Collaboration Patterns Analysis</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Andrei Zammit</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Kenneth Penza</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Foaad Haddod Charlie Abela</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Joel Azzopardi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Arti cial Intelligence, University of Malta</institution>
          ,
          <addr-line>Msida</addr-line>
          ,
          <country country="MT">Malta</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The characteristics of scienti c collaboration networks have been extensively analysed and found to be similar to other scale-free networks. Research has furthermore focused on investigating how collaboration patterns between authors evolved over time, by providing insights into di erent elds of research. Numerous bibliographic datasets, such as DBLP and Microsoft Academic Graph, provide the basis for investigations and analysis of such networks. This paper presents ACE (Academic Collaboration analyzEr); an interactive framework that uses big data technologies and allows for scienti c collaboration patterns to be analysed and visualised. Through ACE it is possible to reveal the key authors in particular elds of research, the topological features of the collaboration network, the network trends over time and the relationships between authors and co-authors. Furthermore, ACE allows for the discovery of potentially new collaborations between authors in the same eld of research as well as elds where scientists can conduct future joint-research work.</p>
      </abstract>
      <kwd-group>
        <kwd>graph analysis</kwd>
        <kwd>big data</kwd>
        <kwd>collaboration patterns</kwd>
        <kwd>collaboration networks</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Bibliometrics and scientometrics are two closely related research elds intended
to measure and analyse scienti c publications and science. Collaboration
analysis is the study in which author collaborations in scholarly articles are used to
establish relationships between authors and/or elds of study. This analysis is
intended to provide insight into the evolving communities of authors and
scholarly publications, the collaborations between authors, and the evolution of areas
of knowledge over time. A high impact factor is partly determined by the number
of citations to articles within a particular journal. If an article is published in a
journal with a high impact factor, the publishing pro le of the author is raised.
The number of citations to that article over time is also a measure of the impact
of that author.</p>
      <p>Collaboration Networks are typically visualized as graphs whereby a vertex
represents some entity and an edge represents some property relating multiple
vertices. In a collaboration graph, vertices can typically represent authors, papers
as well as keyword. Di erent types of edges can be used to represent di erent
interactions between these entities; for instance in the case of an author and
a paper, edges can represent the authoredBy relation, whilst in the case of a
keyword and a paper, an edge can represent the usedBy relation. Depending on
the schema adopted, both vertices and edges can have an arbitrary number of
properties. Furthermore, separating entities into di erent vertices can aid the
visualisation and analysis tasks.</p>
      <p>
        In a network, particular nodes might be more important than others due to
the preferential attachment characteristic highlighted by [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. The notion of
importance can mathematically computed using graph metrics such as closeness,
betweenness and degree centrality. Closeness centrality is de ned as the geodesic
distance, which is the shortest path between two nodes [
        <xref ref-type="bibr" rid="ref13 ref6">6, 13</xref>
        ]. The closeness
centrality is computed by dividing the number of reachable nodes by the sum of the
geodesic distance to each accessible vertex. Betweenness is a centrality measure
computed on shortest paths [
        <xref ref-type="bibr" rid="ref13 ref15 ref29 ref6">6, 13, 29, 15</xref>
        ]. A vertex has a higher betweenness
if more geodesic shortest paths, pass through this vertex. On the other hand
vertices with a higher degree centrality have a higher probability to be part of a
dense network [
        <xref ref-type="bibr" rid="ref13 ref15 ref29 ref6">6, 13, 29, 15</xref>
        ]. In other graph analysis algorithms such is
PageRank [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ], the importance of a node in the network depends on the number of
times a random surfer visits the same page. In the case of websites, if the site
has a high in-degree the probability of revisiting the same site is higher.
      </p>
      <p>
        An interesting aspect in collaboration analysis is the identi cation of
potential collaborators for a given author. Turker et al [
        <xref ref-type="bibr" rid="ref27">27</xref>
        ] report about various
studies that have been performed to analyse co-authorship networks from the
perspective of the research disciplines involved and the journals to which the
research was submitted. In this work mathematical techniques were used to identify
strong collaborations and the authors that were more likely to collaborate with
others. Another approach to co-author prediction reported by [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ] uses random
walks and graph metric to perform author suggestion. The model uses a set of
criteria to select potential candidates including, authors that collaborated with
di erent authors, authors that already collaborated and authors with common
authors.
      </p>
      <p>The provisioning of bibliographic datasets such as DBLP1 and Microsoft
Academic Graph2, together with large-scale graph processing technologies such as
Apache Spark3 and Neo4j4 o er new research opportunities in the bibliometrics
and scientometrics elds.</p>
      <p>
        In recent years, a number of studies have analysed such collaboration
networks in search for emerging trends and communities of interest by leveraging on
big data technologies. Sreenivas et al proposed a data discovery and knowledge
recording tool called SEEKER [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ] that uses big data technologies to help users
1 http://dblp.uni-trier.de/xml/
2 http://aka.ms/academicgraph
3 http://spark.apache.org/
4 https://neo4j.com/
quickly assimilate knowledge from diverse data sources with di erent formats,
hosted across di erent infrastructures. SEEKER provides collaborative
knowledge management tools and access to a data warehouse via a query interface
to provide results via a variety of visualisations. Another big data approach
proposed by [
        <xref ref-type="bibr" rid="ref28">28</xref>
        ], analysed the social network of eleven years of publications in
engineering education and their authors. The bibliometric analysis was based on
grouping authors by the research areas, disciplinary backgrounds and
geographical locations.
      </p>
      <p>In this paper, we present the Academic Collaboration analysEr (ACE)5
interactive framework, which enriches collaboration networks resulting from the
Microsoft Academic Graph and the DBLP datasets with keywords extracted from
the publications. ACE uses big data technologies, Apache Spark and Neo4J to
allow the user to identify research trends and communities in the collaboration
networks. Furthermore, ACE permits the analysis of the networks using di erent
perspectives which include author, keyword and publication year. Through ACE
the user can identify potential collaborators for a given author and the evolving
community of researchers around speci c keywords.</p>
      <p>The rest of the paper is structured as follows: in the next section we present
literature related to di erent collaboration network analysis tools. Then in the
methodology section go in detail through the various steps used to build ACE.
We discuss the challenges that were encountered and explain how we addressed
them. In the experiments section, we report about the ndings from using ACE
to answer a speci c set of queries related to particular authors and keywords.
In the nal section we present some conclusions and ideas how ACE can be
extended.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Literature Review</title>
      <p>
        Nowadays, measuring the scienti c output of researchers is becoming
increasingly important to support research assessment decisions related to accepting
research projects, contracting researchers and/or awarding scienti c prizes [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
Despite the recent advances in scienti c impact prediction and more speci
cally, paper citation prediction, it is still unclear and even controversial whether
one should depend on the reliability and bound of the prediction accuracy of a
long-term citation prediction model. A number of measures, such as the g-index
[
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], h-index [
        <xref ref-type="bibr" rid="ref1 ref12">12, 1</xref>
        ] have become popular measures to gauge journals, scholars,
labs, departments, and institutes [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. Other tools such as Microsoft Academic
Search6, Rexplore [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ], ArnetMiner[
        <xref ref-type="bibr" rid="ref25">25</xref>
        ], and Sa ron7 provide a variety of
visualizations that can be used for trend analysis, such as publication trends and
co-authorship paths among researchers. We can also nd several systems for
5 https://youtu.be/kzXOIzddEa4
6 https://academic.microsoft.com/
7 http://sa ron.insight-centre.org/
exploring and making sense of research data such as Google Scholar8,
FacetedDBLP9 and CiteSeerX [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ].
2.1
      </p>
      <p>
        h-index and g-index
As citation data have become more available, new metrics for analysis have
been developed. The best known metrics include the h-index [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] and g-index
[
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] which are aimed at facilitating the comparisons of the impact or importance
of individual researchers. The h-index is considered to be a way to assess the
impact of an individual author without the skewed citation distribution a ecting
the results. This index re ects both the overall publications as well as the level
of citation of those publications. While evaluation at the level of individuals
is useful, the evaluation at the journal level is more practical for large scale
assessment of research outputs, such as those carried out by universities and
funding agencies [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ]. The easiest method to calculate the h-index is to rst
rank papers in a table in descending order by the number of citations they have
received. The h-index can be applied to journals as well as researchers [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
      </p>
      <p>
        The g-index was introduced as an improvement of the h-index to measure
the global citation performance of a set of articles and it inherits all the good
properties of the h-index and, in addition, takes into account the citation scores
of the top articles. This yields a better distinction and order of the scientists from
the point of view of visibility [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. A measure which should indicate the overall
quality of a scientist or of a journal should deal with the performance of the
top articles and hence their number of citations should be counted. This can be
accomplished by modifying the h-index so that the above described disadvantage
was addressed while keeping all advantages of the h-index and, at the same time,
the calculation of the new index is as simple as that of the h-index [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
2.2
      </p>
      <sec id="sec-2-1">
        <title>Microsoft Academic Search</title>
        <p>
          Microsoft Academic Search (MAS) provides a variety of visualizations,
including co-authorship graphs, publication trends, and co-authorship paths between
authors [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ]. The coverage of MAS at the beginning was limited to the computer
science and technology elds, but this was extended in March 2011 to other
categories thus turning MAS into a platform oriented to the identi cation of the
top papers, authors, conferences and organisations in 15 elds of research and
more than 200 sub- elds. It provides both the bibliographic description of the
publications and their citation counts. In short it o ers everything required to
identify the most relevant research and to carry out comparative performance
assessments [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ]. MAS is a scienti c web database which gathers bibliographic
information from the main scienti c editorials (such as Elsevier10 and Springer11)
8 https://scholar.google.com
9 http://dblp.l3s.de
10 https://www.elsevier.com/
11 www.springer.com/
and bibliographic services (such as CrossRef12). It roughly contains 38.9 millions
of documents and 22 million pro les. Amongst other features, MAS presents a
personal pro le which provides not only the authors list of publications but also
relevant bibliometric indicators (publications, citations), the disciplinary areas
of interest and other rosters showing the most frequent co-authors, preferred
journals and a few important keywords [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ].
2.3
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>Google Scholar</title>
        <p>
          Google Scholar (GS), constituted a great revolution in the retrieval of scienti c
literature, since for the rst time bibliographic search was not limited to the
library or to traditional bibliographic databases. Instead, because it was
conceived as a simple and easy-to-use web service, GS enabled simple bibliographic
search for everyone with access to the web. GS is freely accessible and it indexes
data from publishers only if the publisher is willing to provide at least the
abstract of the paper freely. The data comes from other sources as well, like freely
available full text from preprint servers or personal websites as well, thus in
many cases the full text is freely available for all users [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. GS uses web crawlers
to retrieve scholarly material from journal websites, university repositories, and
authors personal websites. Scholarly documents are identi ed by means of
automatic format inspection such as the title in large font at the front page, authors'
names right below the title, and the presence of a section titled "References"
or "Bibliography"). Indexing is done automatically by parsers that identify
bibliographic data in the selected documents. It has been argued that because of
its automatic inclusion process, GS is susceptible to errors in metadata and to
indexing of non-scienti c works [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ].
2.4
        </p>
      </sec>
      <sec id="sec-2-3">
        <title>ArnetMiner</title>
        <p>
          ArnetMiner o ers di erent visualizations and provides support for expert search
and trend analysis [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ]. This system mainly consists of ve main components:
extraction, integration, storage and access, search, and mining.
        </p>
        <p>
          i. Extraction: Focuses on extracting researcher pro les from the Web
automatically by identifying relevant pages from the web and collecting publications
from existing digital libraries [
          <xref ref-type="bibr" rid="ref25 ref26">25, 26</xref>
          ].
ii. Integration: Integrates the extracted researchers pro les and the extracted
publications by using the researcher name as the identi er. A probabilistic
framework has been proposed to deal with the name ambiguity problem in
the integration. The integrated data is stored into a Researcher Network
Knowledge Base (RNKB)s [
          <xref ref-type="bibr" rid="ref25 ref26">25, 26</xref>
          ].
iii. Storage and Access. Provides storage and index for the extracted and
integrated data in the RNKB [
          <xref ref-type="bibr" rid="ref2 ref25 ref26 ref5">25, 26, 2, 5</xref>
          ].
12 https://www.crossref.org/
iv. Search. Provides three types of search activities; person search, publication
search, and conference search. It also provides other services, e.g., author
interest nding and academic suggestion [
          <xref ref-type="bibr" rid="ref25 ref26">25, 26</xref>
          ].
v. Mining. Provides ve mining services; expert nding, people association
nding, hot-topic nding, sub-topic nding, and survey paper nding [
          <xref ref-type="bibr" rid="ref25">25</xref>
          ].
2.5
        </p>
      </sec>
      <sec id="sec-2-4">
        <title>Rexplore</title>
        <p>
          Rexplore13 is a tool that integrates statistical analysis, semantic technologies,
and visual analytics to provide e ective support for exploring and making sense
of scholarly data [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ]. The semantic relationships among authors and topics are
at the heart of many new functionalities of Rexplore. These relationships are
in particular used for computing novel kinds of similarities and ranking metrics
that take in consideration the semantic characterization of research areas.
Furthermore, the semantic relationships improve the ability of Rexplore to interpret
user queries and enable a novel graph-based navigation technique, which
combines both the semantic relationships and automatically computed metrics to
generate links between the elements of the domain [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ].
        </p>
        <p>
          Rexplore supports users e ectively by enabling them to detect and make
sense of the important trends in one or more research areas. Additionally, users
are able to identify researchers and analyse their academic trajectory and
performance in one or multiple areas, according to a variety of ne-grained
requirements. Furthermore, Rexplore users can discover and explore a variety of
dynamic relations between researchers and topics and rank speci c sets of
authors, generated through multi-dimensional lters, according to various metrics
[
          <xref ref-type="bibr" rid="ref19">19</xref>
          ]. Other important features of Rexplore include:
1. Data Integration: Rexplore integrates a variety of data sources in di erent
formats, including: the MAS API2, DBLP++3 and DBpedia4.
2. Topic Ontology and Klink: while most systems use keywords as proxies for
research topics, Rexplore relies on an OWL ontology, which characterizes
research areas and their relationships.
3. Multi-criteria Search: Rexplore o ers ne-grained search functionality for
authors, publications and organizations with respect to detailed multi-dimensional
parameters.
4. The Graph View: the graph view is an interactive tool to explore the space
of research entities and their relationships using faceted lters. It takes as
input, authors, organizations, countries or research communities and
generates their relationship graph, allowing the user to choose among a variety of
connections, ranking criteria, views and lters.
5. Community Detection: Rexplore integrates a novel algorithm called TST
(Temporal Semantic Topic-Based Clustering), which identi es communities
of researchers who appear to follow a similar research trajectory.
6. Author and Group Analysis: every author in Rexplore has a personal page
which o ers a variety of metrics and visualizations to analyse the authors
performance, trends and collaborations.
13 https://technologies.kmi.open.ac.uk/rexplore/
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Methodology</title>
      <p>In this section we describe the challenges that we had to address within ACE,
from pre-processing to the integration of di erent big data technologies,
3.1</p>
      <sec id="sec-3-1">
        <title>Dataset selection and pre-processing</title>
        <p>Two important reference datasets for bibliographic information about major
computer science publications are DBLP and the Microsoft Academic Graph.
The Microsoft Academic Graph dataset is much larger than DBLP and the
structure of the two datasets is completely di erent. The rst challenge was the
choice of the dataset to use for ACE and the related experiments. The Microsoft
Academic Graph was considered to be too extensive to be processed on a typical
personal computer and hence DBLP was the choice of the dataset. The DBLP
dataset is based on XML and there are two types of records; articles and
inproceedings. Table. 1 shows the record structure.</p>
        <p>DBLP structure
Article In Proceedings
Title Title
Year Page
Volume Volume
EE EE
URL URL
Journal Year</p>
        <p>Book Title
Table 1: DBLP structure</p>
        <p>Microsoft Academic Graph enrichment
Paper Keywords DOI DBLP
Paper ID Paper ID DOI
Title Keyword
Venue Field of study
Author ID
A liation ID
DOI
Journal ID
Conference ID</p>
        <p>Table 2: DBLP structure</p>
        <p>The structure clearly showed that the dataset su ered from missing
information in order to execute the experiments with ACE, namely the keywords, eld
of study and abstract. To source this information, a web crawler and parser were
developed to consume the Digital Object Identi er (DOI) provided in the EE
XML tag. The DOI is a standard used to cite and link permanently to electronic
documents. The DOI would typically direct to a speci c page of a publication
house which contains the title, author, abstract and keywords of a particular
journal or research paper. A Python script was written to extract the EE XML
tag and dump it to a text le. The crawler and parser were instantiated to target
di erent publication houses and any locations which could not be parsed were
stored locally for retry at a later stage. Using this method of sourcing the missing
data, failed after just nearly one thousand records (papers/journals) processed
because the website of the publication house tracked this activity and blocked
the IP of the machine which launched the crawler and parser. An alternative
option that was explored was to use the journals publishers' developer API.
When consumed, these APIs would allow an entity to query a webservice using
the DOI and retrieve meta data such as the abstract and keywords. However,
usage of these APIs was limited to a small number of calls to the service per day.
Given the sheer amount of the DOI required to be fetched and the search limits
imposed it was not feasible to get a reasonable number of abstracts in a short
timeframe. Finally, the unavailability of the abstract and keyword data was
mitigated through the use of the Microsoft Academic Graph dataset. This dataset
is more comprehensive than DBLP and it contains all the required information.
The Microsoft Academic Graph data is a tab delimited text le and is
structured as illustrated in Table. 2. Using DataFrames found in Apache Spark, three
schemas were created; one for the DOI le, and the other two for the Paper and
Keyword les from the Microsoft Academic Graph dataset. The two data-frames
originating from the Microsoft dataset were then joined together via the Paper
ID eld and in turn this joint data-frame was linked to the DOI data-frame via
the DOI key. The process used to enrich the DBLP using the MAG keywords is
illustrated in Figure. 1.
3.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Graph Database</title>
        <p>We evaluated three di erent graph database setups; Neo4j, Apache Spark with
GraphX and Apache Spark with GraphFrames. Apache Spark o ers high
scalability and parallel graph processing. Data manipulation is performed via Scala.
Neo4j is a robust graph database and uses a SQL like language called Cypher
to manipulate data. One of the aims of ACE is to be a portable application
which can be executed on typical everyday personal computers. Hence, one of
the requirements was that the graph database did not require special hardware
to operate and o ers interface APIs. Neo4j is a mature product, backed with
detailed documentation and o cial client APIs for di erent languages. This graph
database has a large community and is widely used in industry. For ACE, Neo4j
was deemed to be the ideal backend candidate. ACE was to be implemented
using the .NET Framework and C# as a language. The main factor for this
decision was that there are currently no o cial .NET API for Spark and on the
other hand Neo4j has its o cial client API. Currently, the only possible binding
using C# with Apache Spark is via Mobius, which is still in early beta stage.
Furthermore, Spark required much more hardware resources than Neo4j which
made Spark impossible to execute on personal computers. Considering these
restrictions, Neo4j and Apache Spark with GraphFrames were chosen to perform
experiments using parallel operations and resilient distributed datasets (RDD).
3.3</p>
      </sec>
      <sec id="sec-3-3">
        <title>Schema</title>
        <p>The enriched dataset consists of information about papers, their authors, author
selected keywords and year of publication. The schema was de ned to map the
information in the dataset into a number of vertices and edges. The details of the
entity were stored in the vertex as properties. A number of edge types were used
to link vertices; for example an author authors a 'paper' whilst a 'paper' has a
'keyword'. When querying the graph, the edge type can be de ned to identify
the relation type being requested. For example, the number of outgoing edges of
type authors amount to the number of papers authored. Similarly, the number
of incoming edges in a keyword vertex amounts to the number of papers using
that keyword. Figure 2 illustrates the graph database schema used in ACE.
MATCH (a:Author)-[r:Authors]-&gt;(p:Paper)-[s:Uses keyword]-&gt;(k: Keyword)
WHERE a.name = 00f0g00
with distinct a as a, p.journal as journal, k, id(a) as currauthorid
MATCH (colla:Author)-[collr:Authors]-&gt;(collp:Paper)-[colls:Uses keyword]-&gt;(k)
WHERE collp.journal = journal and id(colla) &lt;&gt;currauthorid
RETURN distinct(colla.name) as collaborator, a.name as author,
count(k) as keywordmatch, id(a) as idSource, id(colla) as idTarget,
id(k) as idTarget2 ORDER BY keywordmatch DESC;
MATCH (y:Year)&lt;-[r1:Published In]-(p: Paper)-[r: Uses keyword]-&gt;(k: Keyword)
where k.keyword = 00000
return id(k) as idSource, id(p) as idTarget, id(y) as idTarget2,
p.journal as journal, k.keyword as keyword, y.year as year;
The data was extracted in CSV format in the pre-processing phase and was
loaded into Neo4j. Before the data was loaded a number of constraints were
created to speed up the loading. The load command was set to commit every
2500 records to avoid performance issues as recommended by Neo4j bulk load
guide. Additional indexes were created after the loading to speed up cypher
queries. For the Apache Spark database, Scala was used to perform queries
and launch parallel operations. Within ACE potential author collaborators have
common keywords and publication journal. Keywords are used to correlate the
author's areas of interest. The rules used by ACE are the following:
{ Identify the keywords used by a given author;
{ Find authors that have used the same keywords;
{ Select only authors that have authored papers in the same journal;
{ Return list of authors, ranked by keyword matches.</p>
        <p>The cypher query shown in Figure 3 nds potential collaborators for
particular authors.</p>
        <p>A group of authors that have a common area of interest are considered to
be a community. A research domain is identi ed around a given keyword, for
example data mining. Communities have a dynamic nature as they build up,
remain stable or decrease, around topics and journals with time. The cypher
query displayed in Figure 4 is used to nd such communities.
3.5</p>
      </sec>
      <sec id="sec-3-4">
        <title>Visualization</title>
        <p>ACE was designed to present query results graphically using two types of graphs;
a force-directed graph and a force-directed graph with time slider. User query
results are stored in a CSV le and visualised using the D314 visualisation library.
The ACE user can interact with the graph by clicking on the nodes to visualise
more information about the node as shown in Figures 5-9.
14 https://d3js.org/</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Experiments and Evaluation</title>
      <p>Apache Spark with Graphframes utilises dataframes to store edges and vertices.
The loading process entails loading the contents of the text les to a dataframe.
The vertex dataframe must have a numeric column with unique values called id.
The edge dataframe must have two columns with the source and destination id
of the vertices named src and dest respectively. The PageRank algorithm was
used to traverse the graph and nd the most important nodes within the graph.
The results from PageRank are reported in Table 3.
The ACE front-end allows the user to perform several queries interactively:
{ Query the authors and co-authors that collaborated in each year;
{ Author collaboration;
{ Papers that contain a user given keyword;
{ Author collaboration suggestion;
{ The evolution of the community around a user given keyword.</p>
      <p>ACE presents the results as a graph that the user can interact with. In case
of the community evolution across time, via a slider the users can visualize the
evolution of the communities. Data extracted from the ACE system was veri ed
against the DBLP online search provided accessible from the DBLP site.
4</p>
    </sec>
    <sec id="sec-5">
      <title>Conclusion and Future Work</title>
      <p>The emergence of big data technologies and bibliographic datasets have opened
new possibilities in the research areas of bibliometrics and scientometrics. In
this paper, the DBLP dataset was analyzed to extract communities using the</p>
      <p>Top 10 Authors
Hans Jrgen Schneider</p>
      <p>Jarkko Kari
Ehsan Khamespanah</p>
      <p>Stefan Szeider
Richard R. Muntz</p>
      <p>Helmut Alt
Derek Coleman</p>
      <p>Reiji Nakajima
Matthieu Perrinel
Soma Chaudhuri</p>
      <p>PageRank algorithm on Apache Spark. These experiments were executed on
server hardware and operating systems. At a later stage, ACE was developed
using the .NET framework and Neo4j as graph database. ACE is a portable
tool that can be executed on any typical personal computer. Communities and
the evolution of the collaboration network can be analyzed visually. Queries can
be executed and results are processed in a considerable short period of time,
giving the user a truly interactive experience. In ACE, communities are
discovered by traversing the graph according to the input provided by the user.
Apache Spark was used to perform pre-processing and initial analysis. Apart
from PageRank, other community detection algorithms such as Triangle
Counting, Connected Components and Label Propagation Algorithm can be executed
on Apache Spark. On the outset ACE was intended to be an online interactive
tool to allow users to explore collaboration patterns. Potential co-authors for
a given author was determined by nding similar authors in the communities
for a given author. This implementation design transitioned the
implementation focus from Apache Spark to Neo4j. The main drivers for this decision were
the infancy of the .NET connectivity for Apache Spark and integration with
the visualisation part. Further work is required to investigate how ACE can
be transformed into an web application using Apache Spark with automated
visualisation. The graph schema used in graph analysis provides the required
granularity to ful ll the ACE requirements. A number of changes can be made
to improve the results obtained from graph analysis. The schema should be
revised to include edges from the keyword to the paper and from the paper to the
author. The journal publishing the paper is currently an attribute in the paper's
vertex. Extracting journals as separate vertex with the respective edges would
allow computing journal importance. This data can be correlated to determine
whether importance is gained from keyword, authors or both. Currently, ACE
matches author names and keywords using string matching. Analysis on
similarity search would improve system usability. In order to be able to correlate
collaborations ACE should be extended to support multiple keyword searches.
An implementation enhancement that merits further investigation would be the
ability to plugin other datasets using linked data techniques. Enriching ACE
using linked data requires rewriting the pre-processing phases to allow ACE to read
data sources de ned using standard ontologies. A linked data version of ACE
would boost the data available and improve the overall functionality of ACE
namely the author suggestion. Through the linked data approach more context
on the authors involved on the collaboration can be mined. Furthermore, the
topics of collaboration for a given author and the conferences and journals to
which he/she submits research, tend to change over time. Through linked data
the mapping can be preserved and more information can be attained from the
collaboration network.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Acuna</surname>
            ,
            <given-names>D.E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Allesina</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Kording</surname>
            ,
            <given-names>K.P.</given-names>
          </string-name>
          , Future impact:
          <article-title>Predicting scienti c success</article-title>
          .
          <source>Nature</source>
          ,
          <volume>489</volume>
          (
          <issue>7415</issue>
          ), (
          <year>2012</year>
          ) pp.
          <fpage>201</fpage>
          -
          <lpage>202</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Baeza-Yates</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Ribeiro-Neto</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <article-title>Modern information retrieval</article-title>
          (Vol.
          <volume>463</volume>
          ). (
          <year>1999</year>
          ) New York: ACM press.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Barabasi</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Albert</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <source>Emergence of Scaling in Random Networks. Science</source>
          <volume>286</volume>
          , no.
          <volume>5439</volume>
          (
          <issue>1999</issue>
          ) pp.
          <fpage>509</fpage>
          -
          <lpage>512</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>BarIlan</surname>
          </string-name>
          , J.,
          <string-name>
            <surname>Which</surname>
          </string-name>
          h-index?
          <article-title>A comparison of WoS, Scopus</article-title>
          and
          <string-name>
            <given-names>Google</given-names>
            <surname>Scholar</surname>
          </string-name>
          . Scientometrics,
          <volume>74</volume>
          (
          <issue>2</issue>
          ), (
          <year>2008</year>
          ) pp.
          <fpage>257</fpage>
          -
          <lpage>271</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Carroll</surname>
            ,
            <given-names>J.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dickinson</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dollin</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Reynolds</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Seaborne</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Wilkinson</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <article-title>Jena: implementing the semantic web recommendations</article-title>
          .
          <source>In Proceedings of the 13th international World Wide Web conference on Alternate track papers &amp; posters</source>
          , (
          <year>2004</year>
          ) pp.
          <fpage>74</fpage>
          -
          <lpage>83</lpage>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Day</surname>
            ,
            <given-names>M. Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shih</surname>
            ,
            <given-names>S. P.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Chang</surname>
            ,
            <given-names>W. D.</given-names>
          </string-name>
          ,
          <article-title>Social network analysis of research collaboration in Information Reuse and Integration</article-title>
          .
          <source>IEEE International Conference on Information Reuse &amp; Integration</source>
          , (
          <year>2011</year>
          ) pp.
          <fpage>551556</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>De Winter</surname>
            ,
            <given-names>J.C.F.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Zadpoor</surname>
            ,
            <given-names>A. A.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Dodou</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <article-title>The expansion of Google Scholar versus Web of Science: a longitudinal study</article-title>
          .
          <source>Scientometrics</source>
          ,
          <volume>98</volume>
          (
          <issue>2</issue>
          ), (
          <year>2014</year>
          ) pp.
          <fpage>1547</fpage>
          -
          <lpage>1565</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Egghe</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <article-title>Theory and practise of the g-index.</article-title>
          <string-name>
            <surname>Scientometrics</surname>
          </string-name>
          ,
          <volume>69</volume>
          (
          <issue>1</issue>
          ), (
          <year>2006</year>
          ) pp.
          <fpage>131</fpage>
          -
          <lpage>152</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Egghe</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <article-title>An improvement of the h-index: The g-index</article-title>
          .
          <source>ISSI newsletter</source>
          ,
          <volume>2</volume>
          (
          <issue>1</issue>
          ), (
          <year>2006</year>
          ) pp.
          <fpage>8</fpage>
          -
          <lpage>9</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Franceschet</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <source>PageRank: Standing on the Shoulders of Giants. Commun. ACM 54</source>
          ,
          <issue>6</issue>
          , (
          <year>2011</year>
          ) pp.
          <fpage>92</fpage>
          -
          <lpage>101</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Fuyuno</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Cyranoski</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <article-title>Cash for papers: putting a premium on publication</article-title>
          .
          <source>Nature</source>
          ,
          <volume>441</volume>
          (
          <issue>7095</issue>
          ) (
          <year>2006</year>
          ) pp.
          <fpage>792</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Hirsch</surname>
            ,
            <given-names>J.E.</given-names>
          </string-name>
          ,
          <article-title>An index to quantify an individual's scienti c research output</article-title>
          .
          <source>Proceedings of the National academy of Sciences of the United States of America</source>
          , (
          <year>2005</year>
          ) pp.
          <fpage>16569</fpage>
          -
          <lpage>16572</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Hou</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Luan</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Zhuang</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <article-title>The Dynamics of Scienti c Collaboration Networks in Scientometrics</article-title>
          .
          <source>Collnet Journal of Scientometrics and Information Management</source>
          , (
          <year>2013</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Motta</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Osborne</surname>
            ,
            <given-names>F</given-names>
          </string-name>
          ,
          <article-title>Making sense of research with rexplore</article-title>
          .
          <source>Proceedings of the 2012th International Conference on Posters &amp; Demonstrations Track-Volume</source>
          <volume>914</volume>
          (
          <year>2012</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Mutschke</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Mayr</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <article-title>Science models for search: a study on combining scholarly information retrieval and scientometrics</article-title>
          . Scientometrics, (
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Page</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Brin</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Motwani</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          and Winograd,
          <string-name>
            <surname>T.</surname>
          </string-name>
          ,
          <article-title>"The PageRank citation ranking: Bringing order to the Web." Paper presented at the meeting of the</article-title>
          <source>Proceedings of the 7th International World Wide Web Conference</source>
          , Brisbane, Australia, (
          <year>1998</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Ortega</surname>
            ,
            <given-names>J.L.</given-names>
          </string-name>
          ,
          <article-title>In uence of co-authorship networks in the research impact: Ego network analyses from Microsoft Academic Search</article-title>
          .
          <source>Journal of Informetrics</source>
          ,
          <volume>8</volume>
          (
          <issue>3</issue>
          ), (
          <year>2014</year>
          ) pp.
          <fpage>728</fpage>
          -
          <lpage>737</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Ortega</surname>
            ,
            <given-names>J. L.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Aguillo</surname>
            ,
            <given-names>I. F.</given-names>
          </string-name>
          ,
          <article-title>Microsoft academic search and google scholar citations: Comparative analysis of author pro les</article-title>
          .
          <source>Journal of the Association for Information Science and Technology</source>
          ,
          <volume>65</volume>
          (
          <issue>6</issue>
          ), (
          <year>2014</year>
          ) pp.
          <fpage>1149</fpage>
          -
          <lpage>1156</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Osborne</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Motta</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Mulholland</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <article-title>Exploring scholarly data with rexplore</article-title>
          .
          <source>In International semantic web conference</source>
          (
          <year>2013</year>
          ) pp.
          <fpage>460</fpage>
          -
          <lpage>477</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <surname>Osborne</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Motta</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <article-title>Understanding research dynamics</article-title>
          .
          <source>In Semantic Web Evaluation Challenge</source>
          (
          <year>2014</year>
          ) pp.
          <fpage>101</fpage>
          -
          <lpage>107</lpage>
          . Springer International Publishing.
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21.
          <string-name>
            <surname>Rosenstreich</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Wooliscroft</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <article-title>Measuring the impact of accounting journals using Google Scholar and the g-index</article-title>
          .
          <source>The British Accounting Review</source>
          ,
          <volume>41</volume>
          (
          <issue>4</issue>
          ), (
          <year>2009</year>
          ) pp.
          <fpage>227</fpage>
          -
          <lpage>239</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          22.
          <string-name>
            <surname>Salatino</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <source>Early Detection and Forecasting of Research Trends. DC@ISWC</source>
          , (
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          23.
          <string-name>
            <surname>Sukumar</surname>
            ,
            <given-names>S.R.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Ferrell</surname>
            ,
            <given-names>R.K.</given-names>
          </string-name>
          ,
          <article-title>Big Data collaboration: Exploring, recording and sharing enterprise knowledge</article-title>
          .
          <source>Information Services &amp; Use</source>
          ,
          <volume>33</volume>
          (
          <issue>3-4</issue>
          ), (
          <year>2013</year>
          ) pp.
          <fpage>257</fpage>
          -
          <lpage>270</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          24.
          <string-name>
            <surname>Sun</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lin</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Xu</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Ding</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <article-title>How we collaborate: characterizing, modeling and predicting scienti c collaborations</article-title>
          .
          <source>Scientometrics</source>
          , (
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          25.
          <string-name>
            <surname>Tang</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          , Zhang,
          <string-name>
            <surname>J.</surname>
          </string-name>
          , Zhang,
          <string-name>
            <given-names>D.</given-names>
            ,
            <surname>Yao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            ,
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            and
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <surname>J.</surname>
          </string-name>
          ,
          <string-name>
            <surname>Arnetminer:</surname>
          </string-name>
          <article-title>An expertise oriented search system for web community</article-title>
          .
          <source>In Proceedings of the 2007 International Conference on Semantic Web Challenge-Volume</source>
          <volume>295</volume>
          (
          <year>2007</year>
          ) pp.
          <fpage>1</fpage>
          -
          <lpage>8</lpage>
          . CEUR-WS. org.
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          26.
          <string-name>
            <surname>Tang</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          , Zhang, J.,
          <string-name>
            <surname>Yao</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          , Zhang,
          <string-name>
            <given-names>L.</given-names>
            and
            <surname>Su</surname>
          </string-name>
          ,
          <string-name>
            <surname>Z.</surname>
          </string-name>
          ,
          <article-title>Arnetminer: extraction and mining of academic social networks</article-title>
          .
          <source>In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining</source>
          (
          <year>2008</year>
          ) pp.
          <fpage>990</fpage>
          -
          <lpage>998</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          27.
          <string-name>
            <surname>Tu</surname>
          </string-name>
          <article-title>rker, I. and</article-title>
          <string-name>
            <surname>Cavusoglu</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <article-title>How we collaborate: characterizing, modeling and predicting scienti c collaborations</article-title>
          .
          <source>Scientometrics</source>
          , (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          28.
          <string-name>
            <surname>Xian</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <article-title>and</article-title>
          <string-name>
            <surname>Madhavan</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <article-title>Anatomy of Scholarly Collaboration in Engineering Education: A Big-Data Bibliometric Analysis</article-title>
          .
          <source>J. Eng. Educ.</source>
          ,
          <volume>103</volume>
          , (
          <year>2014</year>
          ) pp.
          <fpage>486514</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          29.
          <string-name>
            <surname>Zhao</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Zhao</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <article-title>An evolutionary analysis of collaboration networks in scientometrics</article-title>
          .
          <source>Scientometrics</source>
          ,
          <volume>107</volume>
          (
          <issue>2</issue>
          ), (
          <year>2016</year>
          ) pp.
          <fpage>759772</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>