<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Analyzing Email Patterns with Timelines on Researcher Data</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jangwon Gim</string-name>
          <email>jangwon@kisti.re.kr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yunji Jang</string-name>
          <email>yunji@kisti.re.kr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Do-Heon Jeong</string-name>
          <email>heon@kisti.re.kr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hanmin Jung</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Korea Institute of Science and Technology Information (KISTI) 245 Daehak-ro</institution>
          ,
          <addr-line>Yuseong-gu, Daejeon (305-806)</addr-line>
          ,
          <country country="KR">South Korea</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper proposes a procedure that easily extracts a feature that helps differentiate between similar researcher names in articles. We examined email patterns and their timelines to identify researchers. Our statistical analysis results show multiple email address usage patterns are found in the case of approximately 43% researchers, and 5% of the patterns are overlapped. Base on the statistics, we conclude that the identification of researchers is still required to enhance performance of the researcher-centric analytics systems and applications.</p>
      </abstract>
      <kwd-group>
        <kwd>researcher name disambiguation</kwd>
        <kwd>feature selection</kwd>
        <kwd>researcher data set</kwd>
        <kwd>timeline</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        With ever-increasing amounts of research data and advancements in technology in
big-data environments, a paradigm shift is required. Accordingly, studies on new
business intelligence services are being conducted and forecasting and analysis
methods are being developed. Prescriptive analytics first appeared in 2013 among
several analytical methods and offers diverse strategies for achieving the objectives of
and improving business competence. The 2014 Gartner Hype Cycle Special Report
predicted that prescriptive analytics will advance rapidly and reach a technology
maturity stage within the next ten years1). InSciTe Advisory is a service developed in
2013 for strengthening researcher research skills by using the 5W1H method with
prescriptive analytics [
        <xref ref-type="bibr" rid="ref1 ref2">1,2</xref>
        ]. The service analyzes a researcher’s skill set and provides
analytical results by means of the 5W1H method in order to assist a researcher in
attaining a role model group. However, exact diagnosis and analysis of researchers is
required to provide them with an optimum strategy for reaching their research goals.
To achieve this objective, a researcher’s basic information as well their research data
must be collected completely in order to examine research results and identify fields
of study. This ensures that the researcher is properly identified. For example, a
researcher’s research information is often confused with that of other researchers and
thus retrieved together. This happens because of similar full or abbreviated names of
researchers. Accurately identifying a researcher is thus difficult. If research results are
integrated without accurate identification of the researcher in question, analysis of this
researcher and his or her studies can be either overestimated or underestimated. In this
study, we propose an accurate data acquisition procedure to properly identify
researchers. Our proposed researcher identification method extracts researchers’ email
usage and timeline patterns. The structure of this paper is as follows. We discuss
related studies in Section 2 and describe the feature selection procedure in Section 3.
In Section 4, we present and analyze test data and results. Section 5 concludes our
study and states avenues for our future research.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>
        The amount of academic literature published on the World Wide Web is
evergrowing [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. In this environment, researchers spend much time analyzing the
following: research fields that are growing rapidly, well-known academic literature in
specific research fields, and authors and their work that are most pertinent to their
own research. Accordingly, researcher competence strengthening services that
provide researchers with the most relevant and desirable information are being widely
developed. These services help to ensure accuracy of researcher data and thus a
researcher’s credibility. Proper identification of researcher data is critical and many
studies are being conducted in this area. Such studies on the accurate identification of
research data have been published in databases such as DBLP and PubMed, which are
popular sites for reviewing and collecting high quantities of researcher data [
        <xref ref-type="bibr" rid="ref4 ref5">4,5</xref>
        ].
      </p>
      <p>
        Figure 1 shows an example of researchers who can have the same name. The
authors might be divied two different researchers or might be merged as a researcher.
Ensuring the accuracy of classification is difficult when the network automatically
classifies Researcher 1 as the same person associated with data collected on specific
research papers. Therefore, methods for automatically identifying researchers are
necessary when largescale literature data is considered. In addition, a correct answer
set with high accuracy and an experimental data set are required when researchers
conduct studies based on researcher data. To achieve this, accurate identification of a
researcher is required for certain works which are part of researcher data. Therefore,
currently operating authentication services such as Elsevier SciVal Expert and
ORCID are designed so that researchers can provide relevant information directly and
manage it by themselves [
        <xref ref-type="bibr" rid="ref6 ref7">6,7</xref>
        ]. The accuracy of researcher information is improved
through these services. However, researcher identification remains a problem when
we try to integrate the data provided by these services with previously published data.
      </p>
      <p>
        Therefore, studies on researcher identification based on researcher meta
information extracted from papers and scientific literature data have been proposed.
Studies exist that examine the use of researcher email and affiliation information. The
study in [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] examines the email content of specific researchers and extracts names for
identification purposes. The similarity of names is identified by examining the
extracted name and a sentence containing that name. The researchers in [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] performed
identification based on email contents, but they did not consider characteristics of
email addresses themselves such as character strings. The study in [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] tried to solve
disambiguation problems related to author names on the basis of researcher affiliation
information. To accomplish this, it proposed the pairwise factor graph (PFG) method.
This method generates pairs by randomly combining two papers a researcher has
published and attempts to identify the researcher based on similarity information. In
addition, it examines the distribution of atomic clusters by using the pairs to compare
co-authors with the researcher, affiliation names, and titles of papers. However,
identifying the exact author is difficult when another or several researchers exist who
have the same name. Labeling Oriented Author Disambiguation (LOAD) method
using a machine learning algorithm [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. In LOAD, data was clustered with Precision
Clusters (HPCs) and High Recall Clusters (HRCs). It clustered meta information,
which can be extracted from each paper, including email and affiliation, and
distinguished a different person with the same name by clustering papers by each
author based on HPC. Comparing it to the existing automatic homonymy algorithm,
LOAD improved the accuracy of disambiguation issues and can save a time for a
human to label a specific cluster. However, it does not consider the timeline
information of the features. One of the most important factors for identifying a
researcher in researcher data is timeline information [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. Email address and
affiliation information of a researcher can be changed, added, or deleted. Therefore, a
researcher’s activity history can be tracked if timeline information is used for
identifying a researcher [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ].
      </p>
      <p>In this paper, we introduce the extracting procedure for certain features, such as
email address and affiliation name, which can play an important role in identifying a
researcher. Further, we propose an analysis method based on the timeline, and state
our experimental results.</p>
    </sec>
    <sec id="sec-3">
      <title>Feature Selection</title>
      <sec id="sec-3-1">
        <title>A procedure for selecting email patterns</title>
        <p>This chapter explains the feature extraction procedure from researcher data.</p>
        <p>The feature extraction procedure involves four stages as shown Figure 2. The first
stage involves the collection of researcher data; we collect the meta information of the
published papers that are on the web to identify disambiguation of a researcher’s
name. To do so, we collect the Digital Object Identifiers (DOIs) of the papers; using
these DOIs, we collect the published information on the pertinent sites. Since the
websites where a paper is published are structured in different forms, we develop
customized crawlers to collect data, taking into consideration the structure of each
web page. The second stage is the feature extracting stage; email addresses of
researchers are extracted. During this stage, the year when an email address was
generated is extracted together with the email address to obtain the timeline
information for the email addresses extracted. The third stage is the refining stage;
during this stage, we remove the unnecessary data that may exist in the pertinent
feature. For example, in addition to an affiliation name, the address and postal code of
that organization are mentioned in some scientific papers; the same organization may
be mistaken as different organizations owing to a different address or postal code
mentioned in the papers, and therefore, such unnecessary information is removed. The
last stage is the pattern extraction stage; during this stage, the unique pattern of a
researcher is derived from the extracted pattern information, and this unique pattern
can be used to accurately identify a researcher.
the usage period of the email address can be defined. To estimate this period, the start
and end time of a particular email address in use are defined as the time of first usage
and the time of the latest usage (the last appearance of the pertinent email address),
respectively. For example, more than 2 email addresses appeared for a particular
researcher, and the periods during which each email address was used constitute a
coprime relationship, i.e., the usage period for both the email addresses do not overlap
like Case 1 as shown in Figure 3. The case 2 in Figure 3 shows that email addresses of
researchers with the same name are different from each other but the appearance
periods are overlapped; in the case, it is necessary to identify the researchers as same
or different because it is possible that they are actually different researchers although
their names are the same.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Statistical Analysis</title>
      <sec id="sec-4-1">
        <title>Data set</title>
        <p>Metadata (title of the paper, the publishing year, coauthors, DOI, etc.) of a paper,
as well as the researcher’s name, are included in the researcher data extracted by
DBLP. However, it does not include the email address and affiliation information of a
researcher. Therefore, we implemented an experimental data set according to the
procedure explained in Section 3.1. The number of researchers present on DBLP as of
September, 2014 is about 1,465,700, and the number of papers is about 4,122,000. To
obtain an experiment data set from the pertinent data, we collected website contents
using the DOI of papers. The email information is collected automatically by a
crawler, so the email addresses of all authors are collected. Therefore, it is not
necessary that the n-th email address is the n-th author’s email address. Thus, we
considered only those email addresses for our experimental data, which had the
number of authors equal to the number of email addresses. As a result, 64,802
researchers were extracted for the experimental data. To extract the first author, we
compared the pertinent researcher’s name with the co-author list, located it at the first
instance, and deleted the overlapped name; finally an experimental data set including
18,867 researchers was implemented. We found that these researchers whose names
were extracted using the aforementioned process, published 3,790 papers, which is
46.64% of the entire experimental data set.
4.2</p>
      </sec>
      <sec id="sec-4-2">
        <title>Discussions</title>
        <p>Through the statistical analysis of the results, we found that the number of
overlapped email addresses whose appearance frequency are more than twice is
5.28%. In the data set, 3,162 researchers published at least two papers and a total of
8,126 papers were published by them. We set the minimum number of paper
publications at two as a condition because the number of email addresses can be
considered as one when the number of paper published is one. Further, 1,371 authors’
emails appeared at least twice in the data set. Among the researchers, a total of 167
researchers showed the pattern depicted in the right side of Figure 4, and they
published 574 papers (7.06% of the entire experimental data set). The result about the
overlapped email address is lower than 2% of the result about the overlapped papers.</p>
        <p>It means that the productivity of researchers who have overlapped email patterns
become high, and we awared that additional methods are needed to classify them
because 43% of the researchers who have more than 2 email addresses.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgments References 5</title>
    </sec>
    <sec id="sec-6">
      <title>Conclusion and future studies</title>
      <p>In the big data environment, academic data are generated at a very fast rate. In this
vein, researchers need to quickly obtain and accurately grasp the information
presented in studies related to their research, the possible co-author network, and the
trend in a specific research field to strengthen researchers’ research competence.
Accordingly, prescriptive analytics are required for accurate analysis and establishing
strategies. However, these services have to be implemented based on accurate data to
establish customized strategies for researchers. To this end, the accurate identification
of a researcher’s name and improvement in the information credibility with regard to
researcher data becomes important.</p>
      <p>This paper proposed an extraction procedure for important features from researcher
data to identify ambiguous researcher’s names. To improve researcher identification,
we defined the email address usage pattern by considering the timeline characteristic
of the researcher’s email information and carried out experiments based on the DBLP
data set; we verified that our identification method based on email addresses and that
considers timeline characteristic is effective, and can be used as an important factor
for identifying a researcher. As a future study, we will find a unique pattern
representing a researcher by collecting and extracting the affiliation information
considering the timeline characteristic from researcher data, and will research on an
automated researcher identification system method by applying the obtained pattern
to identify researchers; to verify its effectiveness, we will implement an accurately
refined data set and compare its performance with the experimental data set.</p>
      <p>This work was supported by the IT R&amp;D program of MSIP/KEIT.
[2014-044-024002, Developing On-line Open Platform to Provide Local-business Strategy Analysis
and User-targeting Visual Advertisement Materials for Micro-enterprise Managers].
6
7</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Sa-kwang Song</surname>
          </string-name>
          , Jinhyung Kim, Myunggwon Hwang, Jangwon Kim,
          <string-name>
            <surname>Do-Heon</surname>
            <given-names>Jeong</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Seungwoo</given-names>
            <surname>Lee</surname>
          </string-name>
          , Hanmin Jung, Wonkyung Sung,
          <article-title>"Prescriptive Analytics System for Improving Research Power,"</article-title>
          <source>Proceedings of the 16th International Conference on Computational Science and Engineering (CSE)</source>
          , pp.
          <fpage>1144</fpage>
          -
          <lpage>1145</lpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>Jinhyung</given-names>
            <surname>Kim</surname>
          </string-name>
          , Myunggwon Hwang, Jangwon Gim,
          <string-name>
            <surname>Sa-Kwang</surname>
            <given-names>Song</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Do-Heon</surname>
            <given-names>Jeong</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Seongwoo</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Hanmin</given-names>
            <surname>Jung</surname>
          </string-name>
          ,
          <article-title>"Researcher Performance Analysis and Role Model Recommendation Model for Prescriptive Analytics,"</article-title>
          <source>Proceedings of the Korea Computer Congress 2013 (KCC2013)</source>
          , Vol.
          <volume>40</volume>
          , No.
          <issue>2</issue>
          , pp.
          <fpage>241</fpage>
          -
          <lpage>243</lpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>Madian</given-names>
            <surname>Khabsa</surname>
          </string-name>
          ,
          <article-title>Clyde Lee Giles, "The Number of Scholarly Documents on the Public Web,"</article-title>
          <source>PLoS One</source>
          , Vol.
          <volume>9</volume>
          , No.
          <volume>5</volume>
          ,
          <issue>2014</issue>
          (DOI: DOI: 10.1371/journal.pone.
          <volume>0093949</volume>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Ley</surname>
            ,
            <given-names>Michael.</given-names>
          </string-name>
          <article-title>"The DBLP computer science bibliography: Evolution, research issues, perspectives,"</article-title>
          <source>String Processing and Information Retrieval</source>
          . Springer Berlin Heidelberg, pp.
          <fpage>1</fpage>
          -
          <lpage>10</lpage>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>5. PUBMED, http://www.ncbi.nlm.nih.gov/pubmed</mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>Emily</given-names>
            <surname>Vardell</surname>
          </string-name>
          , Tanya Feddern-Bekcan,
          <string-name>
            <given-names>Mary</given-names>
            <surname>Moore</surname>
          </string-name>
          ,
          <article-title>"SciVal Experts: a Collaborative Tool,"</article-title>
          <source>Medical Reference Services Quarterly</source>
          , Vol.
          <volume>30</volume>
          , No.
          <issue>3</issue>
          , pp.
          <fpage>283</fpage>
          -
          <lpage>294</lpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>7. ORCID, http://orcid.org/</mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>Einat</given-names>
            <surname>Minkov</surname>
          </string-name>
          , William Weston Cohen,
          <string-name>
            <given-names>Andrew</given-names>
            <surname>Ng</surname>
          </string-name>
          ,
          <article-title>"Contextual Search and Name Disambiguation in Email using Graphs,"</article-title>
          <source>Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval</source>
          ,Vol.
          <volume>29</volume>
          , pp.
          <fpage>27</fpage>
          -
          <lpage>34</lpage>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Xuezhi</surname>
            <given-names>Wang</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Jie</given-names>
            <surname>Tang</surname>
          </string-name>
          , Hong Cheng, Philip S. Yu,
          <article-title>"Adana: Active name disambiguation,"</article-title>
          <source>Proceedings of the 11th International Conference on Data Mining (ICDM)</source>
          , pp.
          <fpage>794</fpage>
          -
          <lpage>803</lpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Yanan</surname>
            <given-names>Qian</given-names>
          </string-name>
          , Yunhua Hu, Jianling Cui, Qinghua Zheng, Zaiqing Nie,
          <article-title>"Combining machine learning and human judgment in author disambiguation,"</article-title>
          <source>Proceedings of the 20th ACM international conference on Information and knowledge management</source>
          , pp.
          <fpage>1241</fpage>
          -
          <lpage>1246</lpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Pei</surname>
            <given-names>Li</given-names>
          </string-name>
          , Xin Luna Dong, Andrea Maurino, Divesh Srivastava,
          <article-title>"Linking temporal records,"</article-title>
          <source>Proceedings of the VLDB Endowment</source>
          , Vol.
          <volume>4</volume>
          , No.
          <volume>11</volume>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Jangwon</surname>
            <given-names>Gim</given-names>
          </string-name>
          , Myunggwon Hwang,
          <string-name>
            <surname>Sa-Kwang</surname>
            <given-names>Song</given-names>
          </string-name>
          , Jinhyung Kim,
          <string-name>
            <surname>Do-Heon</surname>
            <given-names>Jeong</given-names>
          </string-name>
          , Hanmin Jung,
          <article-title>"Researcher History Tracking Service for Prescriptive Analytics based on Researcher Activities,"</article-title>
          <source>In Journal of KIISE : Computing Practices and Letters</source>
          , Vol.
          <volume>20</volume>
          , No.
          <issue>6</issue>
          , pp.
          <fpage>0359</fpage>
          -
          <lpage>0363</lpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>