=Paper= {{Paper |id=Vol-1888/paper5 |storemode=property |title=Identifying Collaborations among Researchers: a pattern-based approach |pdfUrl=https://ceur-ws.org/Vol-1888/paper5.pdf |volume=Vol-1888 |authors=Luca Cagliero,Paolo Garza,Mohammad Reza Kavoosifar,Elena Baralis |dblpUrl=https://dblp.org/rec/conf/sigir/CaglieroGKB17 }} ==Identifying Collaborations among Researchers: a pattern-based approach== https://ceur-ws.org/Vol-1888/paper5.pdf
    Identifying collaborations among researchers:
              a pattern-based approach

 Luca Cagliero, Paolo Garza, Mohammad Reza Kavoosifar, and Elena Baralis

  Politecnico di Torino - Dipartimento di Automatica e Informatica - Torino, Italy
{luca.cagliero,paolo.garza,mohammadreza.kavoosifar,elena.baralis}@polito.it



      Abstract. In recent years a huge amount of publications and scien-
      tific reports has become available through digital libraries and online
      databases. Digital libraries commonly provide advanced search inter-
      faces, through which researchers can find and explore the most related
      scientific studies. Even though the publications of a single author can be
      easily retrieved and explored, understanding how authors have collabo-
      rated with each other on specific research topics and to what extent their
      collaboration have been fruitful is, in general, a challenging task.
      This paper proposes a new pattern-based approach to analyzing the cor-
      relations among the authors of most influential research studies. To this
      purpose, it analyzes publication data retrieved from digital libraries and
      online databases by means of an itemset-based data mining algorithm. It
      automatically extracts patterns representing the most relevant collabo-
      rations among authors on specific research topics. Patterns are evaluated
      and ranked according to the number of citations received by the corre-
      sponding publications.
      The proposed approach was validated in a real case study, i.e., the anal-
      ysis of scientific literature on genomics. Specifically, we first analyzed
      scientific studies on genomics acquired from the OMIM database to dis-
      cover correlations between authors and genes or genetic disorders. Then,
      the reliability of the discovered patterns was assessed using the PubMed
      search engine. The results show that, for the majority of the mined pat-
      terns, the most influential (top ranked) studies retrieved by performing
      author-driven PubMed queries range over the same gene/genetic disorder
      indicated by the top ranked pattern.

      Keywords. Weighted itemset mining, data mining and knowledge dis-
      covery


1    Introduction
Plenty of scientific studies have been published on scientific journal, books, and
conference proceedings. To deepen their knowledge on specific research topics
researchers commonly explore influential studies written by the most renowned
authors. Digital libraries and online databases (e.g., PubMed [12], OMIM [4])
play a fundamental role in supporting researchers in their studies. For example,
topic- and author-driven searches are supported by most of the renowned digital
libraries. Typically, searches are manually performed to retrieve the publications
of interest.
    In literature many studies have analyzed the correlation between the authors
of scientific papers and the topics covered by the scientific literature (e.g., [3, 7,
15, 19]). For example, existing approaches allow us to identify the most relevant
publications of an author on a given topic. An established way to measure the
relevance of a publication in the research community is to count the number
of received citations [11]. Hence, a large body of work addressed the problem
of identifying most influential research works covering a given topic by means
of citation content analysis [3, 7, 19]. Citations not only indicate the relevance
of a publication but can be exploited also to assess the influence/reputation of
individual researchers in their community [13].
    Since in several domains research works are often fruit of joint efforts of
many researchers, a parallel research issue is the study of the effectiveness of
the research collaborations among multiple authors. Manually identifying the
most fruitful collaborations on a given research topic is, in general, a challenging
task, because it requires correlating the contribution of multiple authors on a
specific topic by evaluating the significance of their joint research studies with
respect to the existing literature. Hence, automated solutions aimed at analyzing
publication data and automatically discovering fruitful research collaborations
would be desirable.
    To address the aforesaid issue, we propose an automated data mining strat-
egy based on the analysis of publication data acquired from digital libraries
and online databases. The aim is to identify interesting correlations between the
authors of most influential research studies on specific research topics. To ana-
lyze publication data we apply a variant of an FP-Growth-like weighted itemset
mining algorithm [2]. Weighted itemset mining is an established data mining
technique that focuses on discovering recurrent combinations of items, charac-
terized by different importance levels, from transactional data [14, 16, 18]. In our
context, items represent authors or research topics. We discover a new type of
pattern, namely the Authors-Topic Pattern (ATP), which represents combina-
tions of authors and topics that frequently co-occur in the analyzed data (i.e.,
they are associated with a large number of publications). To consider also the
impact of the research collaboration on the research community each publica-
tion in the source data is enriched with the current number of received citations.
Then, pattern relevance is measured as a weighted frequency of occurrence in
the analyzed data (hereafter denoted as influence). To pinpoint for each collab-
oration among multiple authors the research topics that have produced most
authoritative studies, the corresponding patterns are ranked by decreasing in-
fluence. Since the extracted patterns are easily interpretable, users may easily
explore the top ranked patterns generated by the automatic data mining process.
    We experimentally evaluated the effectiveness of the proposed methodology
in a real case study, i.e., the analysis of scientific studies on genomics and genet-
ics acquired from two independent libraries (i.e., OMIM [4] and PubMed [12]).
In the context under analysis, the ATPs mined from OMIM data represent com-
binations of researchers working together on a specific genetic disorder or gene.
We assessed the reliability of the discovered patterns by exploiting the search
engine of the PubMed library. Specifically, for each ATP mined from OMIM
data we performed an author-driven query on PubMed to find the most related
publications co-authored by the same researchers indicated in the pattern. The
query returned a ranked list of related publications. The results show that, for
the majority of the top ranked (automatically generated) ATPs, the most in-
fluential (top ranked) studies returned by author-driven PubMed queries range
over the same gene/genetic disorder indicated by the pattern. Hence, ATPs com-
pactly represent salient information about research collaborations on genomics.
The manual retrieval of the same information would entail performing many
PubMed search queries and then combining the results, which can be a non-
trivial and potentially time-consuming task.
    The rest of the paper is organized as follows. Section 2 compares the proposed
approach with existing studies. Section 3 thoroughly describes the proposed
methodology, while Section 4 experimentally evaluates its effectiveness on real
data. Finally, Section 5 draws conclusions and discusses future developments of
the proposed work.


2   Related work

Studying the impact of scientists’ research based on the citations received by
their academic publications is a known research problem (e.g., [3, 7, 19]). Cita-
tion content analysis is a common way to tackle this problem. It focuses on
analyzing the semantics, syntax, and position in the text of the paper of the
citations to reveal the influence of both authors and scientific papers. For exam-
ple, in [7] the authors analyzed the sentences including citation expressions to
identify interesting characteristics of scholarly communication. The works pre-
sented in [3, 19] classified citations based on their semantics to gain insights
into the relationships between authors and topics. A social network of academic
researchers has been proposed in [15]. The authors automatically extracted re-
searcher profiles from the Web and integrated publication data into the network
from existing digital libraries. The proposed academic search system, called Ar-
netMiner, adopted a predefined Author-Topic model to relate major research
topics to most influential authors. However, to the best of our knowledge, it can
not be trivially adopted to automatically identify fruitful collaborations among
multiple authors on a specific topic. Therefore, the goal of this work is comple-
mentary to the above-mentioned approaches.
    To assist editors in the peer review of scientific papers a significant research
effort has been devoted to proposing new methodologies for matching paper
topics with researchers’ expertise. For example, in [8–10] the authors addressed
the problem of choosing a pool of reviewers for a given article based on the
expertise of a potentially large set of candidate reviewers and on the main topics
covered by the paper under review. They tackled the optimization problem to
assign each paper to at least three independent reviewers with complementary
expertise so that the pool of reviewers assigned to each paper covers most of
the major topics of the paper and each candidate reviewer has a reasonable
number of reviews to do. Conversely, the problem addressed in this paper is not
an optimization issue. Even though discovering fruitful collaborations among
authors of scientific studies may help conference chairs and journal editors to
effectively plan reviewer assignments, the patterns extracted by our methodology
are general and they have not been specifically designed to address the reviewer
assignment problem.
    A parallel research effort has been devoted to efficiently extracting itemsets
and association rules from weighted data [2, 14, 18, 16]. This problem extends the
traditional association rule mining task, which was first introduced in [1] in the
context of market basket analysis, to the case in which data items are no longer
considered as equally relevant within the analyzed data. For example, in the
context of market basket analysis the goal is to find sets of products frequently
purchased together by taking into account not only the list of products that
customers have put into their market basket but also the purchased amount
and unitary price of each purchased product. In [18] the authors proposed to
extract weighted association rules, i.e., rules including weights denoting item
significance are extracted. In [16] and [2] weights are used to drive the frequent
and infrequent itemset mining processes, respectively, while in [14] weights are
automatically generated by means of graph indexing techniques. This paper
applies a variant of a weighted itemset mining algorithm [2] to discover a new
type of patterns, which represents significant correlations between authors and
research topics.

3     Scientific Collaboration Analyzer
Scientific Collaboration Analyzer (SCA) is a new data mining-oriented method-
ology to analyze the scientific literature accessible through digital libraries and
online database. The goal is to identify sets of authors (of arbitrary size) whose
joint research on specific topics has produced publications with significantly high
impact. The methodology consists of two main steps: (i) Data collection and
preparation. Publication data and citations are acquired from online sources,
collected into a unique repository and tailored to the next mining process (see
Section 3.1). (ii) Pattern discovery, evaluation, and ranking. Patterns that rep-
resent combinations of authors and topics are extracted from the prepared data
and ranked according to ad hoc evaluation metrics (see Section 3.2). A more
thorough description of each step follows.

3.1    Data preparation
Publication data are acquired from digital libraries and online databases and
stored in a unique repository. For our purposes, for each publication we acquire
the following information: (i) the Digital Object Identifier (DOI) of the publica-
tion, (ii) the list of authors, (iii) the list of the research topics that are mainly
discussed in the publication, and (iv) the number of citations received.
                Table 1. Example of weighted transactional dataset

   Pub. id #cit.                    Authors                             Topics
      1     10      (Author : Brown, J.), (Author : Smith, L.) (T opic : A), (T opic : X)
      2      5      (Author : Brown, J.), (Author : Smith, L.) (T opic : D), (T opic : X)
      3     10      (Author : Brown, J.), (Author : Smith, L.) (T opic : C), (T opic : Z)
      4      1                 (Author : Smith, L.)            (T opic : X), (T opic : Z)
      5     10      (Author : Brown, J.), (Author : Smith, L.) (T opic : C) (T opic : X)
      6     12                 (Author : Smith, L.)                   (T opic : Z)



Table 2. Patterns ranked by decreasing influence. Minimum influence threshold
mininf =20

                          Authors-Topic Pattern                      Influence
           {(Author : Brown, J.),(Author : Smith, L.), (T opic : X)}     25
                      {(Author : Smith, L.),(T opic : Z)}                23




    The current number of citations is considered because it has largely been
used to assess the influence/popularity of a publication in the research commu-
nity [11]. However, since the proposed methodology is general, different measures
can be easily integrated as well (e.g., the Hirsch index [6]).
    To allow pattern mining, publication data are collected into a weighted trans-
actional dataset. A weighted transactional dataset is a set of pairs htransaction,
weighti, where each transaction corresponds to a different scientific publication,
while weight is the number of citations received by the corresponding publication.
We consider as transaction weight, denoting the relevance of each publication,
the number of citations.
    Transactions consist of sets of items, where items are publication authors
(e.g., Smith, L.), or research topics (e.g., topic X). Items are represented in
the form (feature:value), where feature is Author or Topic, while value is the
corresponding feature value.
    A more formal definition of weighted transactional dataset is given below.

Definition 1. Weighted transactional dataset. Let A be the set of authors
and T be the set of topics. Let P be the set of all scientific publications and let
C(pi ) (pi ∈ P ) be the number of citations received by publication pi . An item
ik is a pair feature:vq , where vq ∈ A if feature is equal to Author or vq ∈ T if
feature is equal to Topic. A transaction tj is a set of items related to publication
pj . A weighted transactional dataset D is a set of weighted transactions, where
each weighted transaction twj ∈ D corresponds to a different publication pj ∈ P
and it consists of a pair htj , C(pj )i.

    For instance, Table 1 reports an example of dataset consisting of six weighted
transactions, each one corresponding to a different scientific publication. Each
publication, identified by the respective id, is weighted by the corresponding
number of citations (see Column # cit.). For each publication the list of authors
(see Column Authors) and the covered topics (see Column Topics) are known.
Publications can be co-authored, and can be related to many topics. For example,
publication with pub. id 1 received 10 citations (i.e., transaction weight equal
to 10). Its corresponding transaction consists of the following items: Author :
Brown, J., Author : Smith, L., T opic : A and T opic : X. The transaction
refers to a publication that was co-authored by Brown J. and Smith L. and that
relates to topics A and X.

3.2   Pattern discovery, evaluation, and ranking
This step entails discovering a new type of pattern from the prepared weighted
transactional dataset, namely the Authors-Topic Pattern (ATP). It represents a
potentially interesting correlation between a set of authors and a research topic.
    Preliminaries. Pattern extraction relies on itemset mining techniques. Fre-
quent itemset mining [1] is an established data mining technique to discover
recurrent correlations among data items hidden in large datasets. A k-itemset is
a set of k distinct items in a transactional dataset. It indicates the co-occurrence
of the corresponding items in the analyzed dataset. In our context of analysis,
an item represents either an author or a topic (see Definition 1). Hence, item-
sets may represent co-occurrences of multiple authors and topics in the analyzed
dataset. A more formal definition of itemset is given below.
Definition 2. Itemset. Let D be a weighted transactional dataset and let I
be the set of distinct items in the form feature:vq contained in any weighted
transaction twj ∈ D. A k-itemset (i.e., an itemset of length k) is a set of k
distinct items in I.
    Note that each itemset may contain an arbitrary number of items belonging
to any feature.
    Since generating all the possible itemsets is computationally intractable even
on medium-size datasets, itemset mining is commonly driven by a minimum sup-
port threshold [1]. More specifically, frequent itemset mining entails extracting
all the itemsets that frequently occur in the source dataset D, i.e., all itemsets
whose frequency of occurrence (support) in the source dataset is above a given
threshold minsup. The support threshold prevents the extraction of less relevant
or misleading itemsets. Thus, it allows us to consider only the most recurrent
and thus potentially reliable patterns.
    For example, itemset {(Author : Brown, J.),(T opic : X)} occurs three times
in the dataset in Table 1 (publications with ids 1, 2, and 5). Hence, by enforcing a
minimum support threshold minsup=2 the itemset would be extracted because
its frequency of occurrence (3) is above the minimum (user-provided) threshold.
    Unfortunately, the number of frequent itemsets can be very large. To prevent
the generation of redundant patterns, thus simplifying the manual inspection
of the result, a more compact subset of frequent itemsets, called the closed
itemsets [17], can be extracted. An itemset is closed if there exists no superset
that has the same support as this original itemset.
    Pattern definition. For our purposes, we are interesting in mining a partic-
ular type of closed itemsets: the combinations of authors with a specific research
topic. Hereafter, we will denote it as Authors-Topic Patterns (ATP).
Definition 3. Authors-Topic Patterns (ATP). Let I be a closed k-itemset.
I is an ATP if (i) it contains one or more items feature:vq such that fea-
ture=Author, and (ii) it contains exactly one item such that feature=Topic.
    Recalling the running example, Table 2 reports some examples of ATPs mined
from the dataset in Table 1. For example, {(Author : Brown, J.),(Author :
Smith, L.), (T opic : X)} indicates that researchers Brown J. and Smith L. have
co-authored many research publications on topic X.
    Pattern evaluation. The support quality index of an itemset [1] does not
consider the relative importance of each transaction in the source dataset. More
specifically, in our context of analysis, each publication may have a different
impact on the research community. Some publication can be highly influential,
whereas others may have a limited scope. Hence, to evaluate pattern significance,
pattern occurrences in each publication are weighted according to its impact on
the research community.
    Since our goal is to generate only the combinations of authors and topic
that have achieved a high impact, we extended the standard itemset mining
problem by integrating item weights [18]. Specifically, item occurrences within
each transaction (publication) are weighted by the corresponding number of
citations. Therefore, the co-authorship of publications with a large number of
citations is rewarded, whereas co-authorship of publications with few citations
are penalized. To formalize this step, we introduce the concept of influence of
an ATP as a weighted frequency of occurrence of the itemset in the weighted
transactional dataset.
Definition 4. ATP influence. Let D be a weighted transactional dataset and
I be an ATP. Let twj : htj , C(pj )i be an arbitrary weighted transaction in D.
The influence of ATP I in D, hereafter denoted by inf(I), is defined as follows:
                                        X
                            inf(I) =          C(pj )
                                     twj ∈D|I⊆tj

    Recalling the previous example, ATP {(Author : Brown, J.),(Author :
Smith, L.), (Disorder : X)} has an influence equal to 25 because it covers
the weighted transactions with publication ids 1 (weight 10), 2 (weight 5), and
5 (weight 10), respectively.
    Pattern filtering and ranking. To filter out less interesting patterns a
minimum influence threshold is enforced. Specifically, given a weighted trans-
actional dataset and a user-specified minimum influence threshold mininf , we
extract all the ATPs whose influence value is above or equal to the threshold.
Patterns can be sorted by decreasing influence to quickly retrieve the most rel-
evant ones.

3.3   The extraction algorithm
Many frequent weighted itemset mining algorithms have already been proposed
in literature (e.g., [2, 14, 16, 18]). To accomplish the ATP mining task from
weighted transactional data, we adopt a variant of the algorithm first proposed
in [2], which performs FP-Growth-like closed itemset mining. FP-Growth [5]
relies on an FP-tree data model, i.e., a compact, tree-based representation of
the original dataset residing in main memory. To efficiently generate ATPs on
top of closed itemsets, we separately extracted closed itemsets for each topic by
recursively visiting the FP-tree structure in a depth-first manner.



4   Case study

We investigated the applicability of the proposed methodology in a real case
study, i.e., the analysis of the research collaborations on genomics and genetics.
To perform our experiments we analyzed publication data that were acquired
from the Online Mendelian Inheritance in Man (OMIM) catalog of genetic dis-
orders [4].
     The Online Mendelian Inheritance in Man (OMIM) database [4] is one of the
most comprehensive and authoritative compendia of human genes and genetic
phenotypes. OMIM is part of the National Center for Biotechnology Information
(NCBI) system of databases [12] and it is freely available on the Web. OMIM
collects information on all known mendelian disorders and over 12,000 genes.
Specifically, it thoroughly describes the relationships between phenotypes and
genotypes by providing full-text, referenced overviews on genetic disorders. The
database is updated daily and thus its content is continuously evolving over
time. OMIM exposes public Application Programming Interfaces (APIs) for ge-
netic data crawling and download. Specifically, it allows users to acquire the list
of all known disorders and a set of related annotations. Disorder annotations
consist of (i) the set of genes correlated with the disorder, (ii) a list of scien-
tific publications ranging over the disorder (for each publication the complete
bibliographic information is known), (iii) a textual description of the disorder
including references, and (iv) links to other genetics resources.
     Our study is focused on discovering from OMIM data sets of researchers that
have conducted influential studies on genomics or genetics. To tailor OMIM data
to our context of analysis, we considered as topic categories the genetic disorders
and the genes discussed in each publication. Specifically, the weighted transac-
tional datasets contains items belonging to three different features: Author, Gene,
and Genetic Disorder. To crawl data from the online OMIM database, we ex-
ploited the exposed APIs [4]. Instead, to retrieve the number of citations received
by each publication we exploited the APIs of the PubMed digital library [12].
The integrated dataset, which were obtained by integrating publication data
from OMIM and citation data from PubMed, contains 8825 articles, 34555 au-
thors, 302 disorders, and 1076 genes. The experiments were performed on a 2.67
GHz Intel Xeon workstation with 32 GB of RAM, running Ubuntu Linux 12.04
LTS. The data crawler and the data preparation steps are written in Java, while
the pattern mining algorithm is written in C.
                                         3000

                                         2500




                    Number of itemsets
                                         2000

                                         1500

                                         1000

                                         500

                                           0
                                                0   200   400   600    800  1000 1200 1400     1600   1800   2000
                                                                   Minimum support threshold




                                           Fig. 1. Number of mined patterns

Table 3. Number of mined patterns vs number of authors in the pattern (minimum
influence = 50)

                 num. of co-authors num. of ADPs num. of AGPs
                          2              760          164
                          3              447          375
                          4              286          496
                          5              230          411
                          6              151          319
                          7              117          340
                          8              101          254
                          9               82          194
                         10               69          160




4.1   Pattern characteristics


We analyzed the characteristics of the patterns mined by setting different values
of minimum influence threshold (i.e., the minimum number of received citations).
Figure 1 plots the number of mined ATPs related to genes, hereafter denotes
as Author-Gene Patterns (AGPs) for the sake of brevity, and the number of
mined ATPs related to disorders, denoted as Author-Disorder Patterns (ADPs),
by varying the mininf threshold value.
    By setting the influence threshold we may discover research collaborations
whose production has had a rather different impact in their community. For ex-
ample, by setting mininf to 400 ATPs represent research teams whose scientific
studies on a specific gene/disorder have produced at least 400 citations. By in-
creasing mininf the constraint on the minimum number of citations becomes
more selective. Hence, as expected, the number of mined patterns decreases more
than linearly while increasing the mininf value. The two curves (those related
to AGPs and to ADPs) show similar trends.
    In Table 3 we categorized the extracted ADPs and AGPs according to the
number of authors appearing in each pattern. The reported categorization ap-
proximately indicates the average impact of the research groups with a given
size. Notably, the influence is not proportional to the number of authors. Small-
and medium-size groups (e.g., from 3 to 5 persons) with high research influence
are quite frequent. However, a significant number of larger groups (7-8 persons)
have produced influential studies as well1 . To reduce the bias due to large re-
search teams, groups of few researchers should be analyzed first. Alternatively,
pattern occurrences can be weighted by the group size beyond the number of
received citations.


4.2    Pattern analysis

We empirically analyzed the strength of the correlations between the research
teams and the topic identified by the pattern. For each topic we considered the
top-5 ranked patterns, mined from OMIM data, in order of decreasing influence.
Each of the selected patterns indicates the most important topic addressed by
the research team. The research questions we would like to address in this section
are the following:
(A) Are the research team and the topic really correlated with each other?
(B) Among the topics addressed by the team, is the topic indicated in the pattern
the most influential one?
    To address the above research questions, we assessed the quality of the mined
patterns by performing author-driven queries on the PubMed digital library.
Specifically, for each pattern we picked the research team (i.e., the author names
occurring in the pattern) and we performed author-driven query on PubMed.
The PubMed query returns a list of publications ranked by decreasing relevance.
If a publication in the top-3 PubMed ranking covers the topic we can conclude
that research team and topic are, to some extent, correlated with each other
(question (A)). If the publication covering the topic is at the top of the PubMed
ranking, we can conclude that the addressed topic is the most influential among
those addressed by the research team (question (B)).
    We performed the above-mentioned comparison separately for genes and ge-
netic disorders. Specifically, Table 4 summarizes the results of the manual vali-
dation performed on the top-5 Authors-Disorder Patterns (i.e., the ATPs related
to genetic disorders), while Table 5 reports similar results for the top-5 Authors-
Gene Patterns (i.e., the ATPs related to gene). For each pattern we reported the
title and the Digital Object Identifier of the top ranked publication returned by
PubMed. In most cases, the selected publication matches the topic indicated by
the pattern. To show the correlation between genes/disorders and publication
content, in Column PubMed publication title we highlighted the title keywords
that recall, to some extent, the gene name or the genetic disorder indicated in the
pattern. For example, the top ranked Author-Gene Pattern in Table 5 concerns
gene AUTS1, whose mutations are strongly correlated with the autism disorder.
In the top ranked publication returned by PubMed the title made explicit ref-
erence to the correlated disorder (Recurrent de novo mutations implicate novel
genes underlying simplex autism risk.).

1
    Note that genomic and genetic studies are likely to be co-authored by many re-
    searchers.
                    Table 4. Example of Authors-Disorder Patterns


Authors-Disorder Pattern                                                       PubMed     publication
                                       PubMed publication title
(influence)                                                                    DOI
                                       Inclusions in frontotemporal lobar de-
                                       generation with TDP-43 proteinopathy
{Author:Siddique T., Author:Deng H.
                                       (FTLD-TDP) and amyotrophic lateral 10.1007/s00401-013-
X.,   Disease:AMYOTROPHIC        LAT-
                                       sclerosis (ALS), but not FTLD with 1089-6
ERAL SCLEROSIS} (inf =1828)
                                       FUS proteinopathy (FTLD-FUS), have
                                       properties of amyloid.
{Author:Rioux J. D., Author:Silverberg
                                       Host-microbe interactions have shaped
M.     S.,   Disease:INFLAMMATORY
                                       the genetic architecture of inflamma- 10.1038/nature11582
BOWEL DISEASE (CROHN DIS-
                                       tory bowel disease.
EASE)} (inf =1470)
{Author:Flaherty K. T., Author:Ribas
A., Author:Chapman P. B., Dis- Improved survival with vemurafenib in
ease=MELANOMA            CUTANEOUS melanoma with BRAF V600E muta- 10.1056/NEJMoa1103782
MALIGNANT SUSCEPTIBILITY TO} tion.
(inf =1253)
                                       Revised Bethesda Guidelines for hered-
{Author:Hamilton S. R., Author:de la
                                       itary nonpolyposis colorectal cancer
Chapelle A., Disease=LYNCH SYN-                                                   Not available
                                       (Lynch syndrome) and microsatellite
DROME I} (inf =1116)
                                       instability.
{Author:Thornton      C.   A.,    Au-
thor:Swanson       M.     S.,     Dis- Splicing biomarkers of disease severity in
                                                                                  10.1002/ana.23992
ease=MYOTONIC           DYSTROPHY} myotonic dystrophy.
(inf =889)




                      Table 5. Example of Authors-Gene Patterns


Authors-Gene Pattern                                                           PubMed     publication
                                       PubMed publication title
(influence)                                                                    DOI
                                       Recurrent de novo mutations implicate
{Author:O’Roak B. J., Author:Vives L.,
                                       novel genes underlying simplex autism 10.1038/ncomms6595
GeneSymbols:AUTS1} (inf =1369)
                                       risk.
                                       Recapitulation of pancreatic neuroen-
{Author:Spiegel A. M., Author:Marx S. docrine tumors in human multiple en- 10.1158/0008-
J., GP-GeneSymbols:MEN1} (inf =620) docrine neoplasia type I syndrome via 5472.CAN-08-3662
                                       Pdx1-directed inactivation of Men1.
{Author:Allen R. P., Author:Earley C. Intervening Leg Movements Disrupt
                                                                                10.1093/sleep/zsw023
J., GP-GeneSymbols:RLS1} (inf =219) PLMS Sequences.
                                       Novel mutations in the long isoform of
{Author:Berson E. L., Author:Dryja T. the USH2A gene in patients with Usher
                                                                                10.1136/jmg.2009.075143
P., GP-GeneSymbols:RP} (inf =553)      syndrome type II or non-syndromic re-
                                       tinitis pigmentosa.
                                       GWAS for serum galactose-deficient
{Author:Julian B. A., Author:Wyatt R.
                                       IgA1 implicates critical genes of the O- 10.1371/journal.pgen.1006609
J., GP-GeneSymbols:IGAN1} (inf =311)
                                       glycosylation pathway.
5    Conclusion and future work
This paper presents an itemset-based approach to analyzing publication data and
to discovering fruitful collaboration among researchers. The proposed method-
ology generates interpretable patterns that compactly represent collaborations
among researchers that have produced the most influential studies. The appli-
cability and effectiveness of the proposed methodology has been experimentally
evaluated in real case study, i.e., the analysis of the publications related to
genomic and genetics studies. As future work, we aim at (i) testing different
weighting functions (e.g., weighting research group size beyond the number of
citations), (ii) applying the proposed methodology to support reviewer assign-
ment in the process of paper peer reviewer, and (iii) address community detec-
tion and organization finding in the context of project applications. We aim at
integrating Authors-Topic associations into existing optimization-based strate-
gies (e.g., [8, 9]). For example, considering correlations between multiple authors
and a topic would help editors to diversify assignments across researchers with
complementary expertise.


References
 1. R. Agrawal, T. Imielinski, and Swami. Mining association rules between sets of
    items in large databases. In ACM SIGMOD 1993, pages 207–216, 1993.
 2. L. Cagliero and P. Garza. Infrequent weighted itemset mining using frequent
    pattern growth. IEEE Trans. Knowl. Data Eng., 26(4):903–915, 2014.
 3. Y. Ding, G. Zhang, T. Chambers, M. Song, X. Wang, and C. Zhai. Content-based
    citation analysis: The next generation of citation analysis. JASIST, 65:1820–1833,
    2014.
 4. A. Hamosh, A. Scott, J. Amberger, D. Valle, and V. McKusick. Online mendelian
    inheritance in man (omim). Human Mutation, 15(1):57–61, 2000.
 5. J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation.
    In SIGMOD’00, Dallas, TX, May 2000.
 6. J. E. Hirsch. An index to quantify an individual’s scientific research output that
    takes into account the effect of multiple coauthorship. Scientometrics, 85(3):741–
    754, Dec. 2010.
 7. H. J. Kim, J. An, Y. K. Jeong, and M. Song. Exploring the leading authors
    and journals in major topics by citation sentences and topic modeling. In
    BIRNDL@JCDL, 2016.
 8. N. M. Kou, L. H. U., N. Mamoulis, and Z. Gong. Weighted coverage based reviewer
    assignment. In Proceedings of the 2015 ACM SIGMOD International Conference
    on Management of Data, SIGMOD ’15, pages 2031–2046, New York, NY, USA,
    2015. ACM.
 9. N. M. Kou, L. H. U, N. Mamoulis, Y. Li, Y. Li, and Z. Gong. A topic-based
    reviewer assignment system. Proc. VLDB Endow., 8(12):1852–1855, Aug. 2015.
10. B. Li and Y. T. Hou. The new automated ieee infocom review assignment system.
    IEEE Network, 30(5):18–24, September 2016.
11. C. Lu, C. Zhang, and S. Ma. How does citing behavior for a scientific article
    change over time?: A preliminary study. In Proceedings of the 78th ASIS&T Annual
    Meeting: Information Science with Impact: Research in and for the Community,
    ASIST ’15, pages 97:1–97:4, Silver Springs, MD, USA, 2015. American Society for
    Information Science.
12. NCBI. National Center for Biotechnology Information Website. Available at
    http://www.ncbi.nlm.nih.gov/ Last access: May 2017, 2017.
13. A. Sidiropoulos, D. Katsaros, and Y. Manolopoulos. Generalized hirsch h-index for
    disclosing latent facts in citation networks. Scientometrics, 72(2):253–280, 2007.
14. K. Sun and F. Bai. Mining weighted association rules without preassigned weights.
    IEEE Transactions on Knowledge and Data Engineering, 20(4):489 –495, 2008.
15. J. Tang, J. Zhang, L. Yao, J.-Z. Li, L. Zhang, and Z. Su. Arnetminer: extraction
    and mining of academic social networks. In KDD, 2008.
16. F. Tao, F. Murtagh, and M. Farid. Weighted association rule mining using weighted
    support and significance framework. In Proceedings of the ninth ACM SIGKDD
    international conference on Knowledge discovery and data mining, KDD’03, pages
    661–666, 2003.
17. J. Wang, J. Han, and J. Pei. Closet+: searching for the best strategies for mining
    frequent closed itemsets. In L. Getoor, T. E. Senator, P. Domingos, and C. Falout-
    sos, editors, Proceedings of the Ninth ACM SIGKDD International Conference on
    Knowledge Discovery and Data Mining, pages 236–245, 2003.
18. W. Wang, J. Yang, and P. S. Yu. Efficient mining of weighted association rules
    (WAR). In Proceedings of the sixth ACM SIGKDD international conference on
    Knowledge discovery and data mining, KDD’00, pages 270–274, 2000.
19. G. Zhang, Y. Ding, and S. Milojevic. Citation content analysis (cca): A framework
    for syntactic and semantic analysis of citation content. JASIST, 64:1490–1503,
    2013.