<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Extending Scientific Literature Search by Including the Author's Writing Style</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Andi Rexha</string-name>
          <email>arexha@know-center.at</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mark Kröll</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hermann Ziak</string-name>
          <email>hziak@know-center.at</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Roman Kern</string-name>
          <email>rkern@know-center.at</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Know-Center GmbH</institution>
          ,
          <addr-line>Inffeldgasse 13, A-8010 Graz</addr-line>
          ,
          <country country="AT">Austria</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2017</year>
      </pub-date>
      <fpage>93</fpage>
      <lpage>100</lpage>
      <abstract>
        <p>Our work is motivated by the idea to extend the retrieval of related scientific literature to cases, where the relatedness also incorporates the writing style of individual scientific authors. Therefore we conducted a pilot study to answer the question whether humans can identity authorship once the topological clues have been removed. As first result, we found out that this task is challenging, even for humans. We also found some agreement between the annotators. To gain a better understanding how humans tackle such a problem, we conducted an exploratory data analysis. Here, we compared the decisions against a number of topological and stylometric features. The outcome of our work should help to improve automatic authorship identification algorithms and to shape potential follow-up studies.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>A PhD student, for instance, who intends to improve her writing could benefit from the
aforementioned search functionality. She could search for similarly written papers with the
one she likes, and learn writing patterns out of them. Hence, we envision new retrieval
systems providing a slider to give more weight to the preferred dimension of search. Figure 1
shows a possible search scenario, where the user may choose how much emphasis should be
put on the topological information(which we consider the semantic of the text) and the writing
style.</p>
      <p>As a first step towards building such a system, we conducted a pilot study in order to
understand whether humans are capable of distinguishing between writing styles without
having topological information. Though we found some agreement between the annotators,
our findings reveal that this task is a challenging one, even for humans. To learn more about
this problem and its difficulties, we conducted an exploratory data analysis where we
statistically compared the decisions against a number of topological and stylometric features.
We believe our findings to be valuable – not only for integrating writing style information
into the retrieval process – but also to improve the automatic attribution of authorship. In
addition, we make our dataset publicly available1.</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>Over the past decades one can observe an ever growing amount of scientific output; much to
the joy of research areas such as (i) Bibliometrics which applies statistics to measure scientific
impact and (ii) Information Retrieval which applies natural language processing to make the
valuable body of knowledge accessible.</p>
      <p>
        Both fields benefit from adding semantics to scientific publications. This includes assigning
instances to concepts which are organized and structured in dedicated ontologies. Entity and
relation recognition thus represent a valuable pre-processing step for subsequent search
procedures. Medical entity recognition
        <xref ref-type="bibr" rid="ref1">(cf. Abacha &amp; Zweigenbaum, 2011)</xref>
        seeks to extract
instances from classes such as “Disease”, “Symptom” or “Drug” to enrich the retrieval
process. In bioinformatics,
        <xref ref-type="bibr" rid="ref23">Zweigenbaum et al., 2007</xref>
        identify biological entities, for example,
instances of “Protein”, “DNA” or “Cell Line”, and extract the relations between these entities
as facts or events. Research assistants such as BioRAT or FACTA then can offer an added
value employing this type of semantic information. BioRAT (cf.
        <xref ref-type="bibr" rid="ref4">(Corney et al., 2004)</xref>
        ) is
given a query, and autonomously, finds a set of papers, applies natural language processing to
identify biomedical entities, and highlights the most relevant facts. FACTA (cf.
        <xref ref-type="bibr" rid="ref21">(Tsuruoka et
al., 2008)</xref>
        ) searches Medline abstracts with an emphasis on biomedical concepts.
        <xref ref-type="bibr" rid="ref13">Liakata et al., (2012</xref>
        ) departed from a mere content-level enrichment and focused on the
discourse structure to characterize the knowledge conveyed within the text. For this purpose,
they identified 11 core scientific concepts including “Motivation”, “Result” or “Conclusion”.
In the Partridge system,
        <xref ref-type="bibr" rid="ref17">Ravenscroft et al., (2013</xref>
        ) build upon the automated recognition to
automatically categorize articles according to their types such as Review or Case Study. The
TeamBeam (cf.
        <xref ref-type="bibr" rid="ref12">(Kern et al., 2012)</xref>
        ) algorithm aims to extract an article’s meta-data, such as
the title, journal name and abstract, as well as explicit information about the article's authors.
Implicit information about an author includes her writing style which reflects among others
the writer’s personality as well as directly relates to characteristics such as readability, clarity,
aso. Stylometry represents the line of research which focuses on defining features to quantify
an author's writing style
        <xref ref-type="bibr" rid="ref10">(Holmes, 1998)</xref>
        .
        <xref ref-type="bibr" rid="ref2">Bergsma, Post &amp; Yarowsky (2012</xref>
        ) used stylometric
features to detect the gender of an author and to distinguish between native vs. non-native
speakers and conference vs. workshop papers. Stylometry is, for example, employed to
attribute authorship, i.e. from a set of candidate authors the author of a questioned article is to
be selected (cf.
        <xref ref-type="bibr" rid="ref20">Stamatatos (2009)</xref>
        ,
        <xref ref-type="bibr" rid="ref11">Juola (2008)</xref>
        ).
      </p>
      <p>
        With respect to bibliometrics, citation information have been recently explored to enrich the
retrieval process. Dabrovska &amp; Larson, (2015) extracted citation contexts from citing articles
and used them in the scientific search process. Preliminary results indicated that including
citation contexts had a small but positive impact.
        <xref ref-type="bibr" rid="ref7">Eck, N.J., &amp; Waltman, L. (2014</xref>
        ) introduced
CitNetExplorer, a software tool for analysing and visualizing citation networks, and that can
thus be used for citation-based scientific literature retrieval.
      </p>
    </sec>
    <sec id="sec-3">
      <title>Experimental Setup</title>
      <p>In order to understand whether humans are able to identify the authorship once the topological
information has been removed, we conduct a pilot study. In this study, we provide human
annotators with one source and four target textual snippets in different experiments. In the
first, one of the targets is written by the same author as the source and the other three are
written by different authors as the source. Then, we let the annotators rank the snippets from
the most to the least similar with respect to the writing style, asking them to rank as “most
similar” the snippet written by the same author (see Figure 2).</p>
      <p>For the study, we selected data from Pubmed2, a free database created by the US National
Library of Medicine. This database holds full-text articles from the biomedical domain
together with a standard XML markup that rigorously annotates the complete content of the
published document. It also contains valuable metadata like the authors and the journal in
which the article is published. At first, we retrieve documents written by only a single author
to obtain “pure” writing styles. Note that it can happen that some articles can be written by
ghostwriters or by colleagues of authors helping them with English writing. Yet, we believe
that this is a very rare case.
From the previously selected documents, we choose a subset and decide to make the
annotators rank text snippets which are drawn from the beginning of the introduction section
(we select the first sentences until the one ending after the 400-th character). The rationale
behind this choice is twofold: i) it gets more and more difficult for the user to remain focused
on the task while reading a long text; ii) we hypothesize that the introduction contains less
topological information than other parts of the scientific papers.</p>
      <p>Having selected the text snippets, we designed three experiments (which we call a “task”) for
each annotator. For each of the experiments we present the annotators a source and four target
snippets, subject to different problem settings (see Figure 2):
•
•
•</p>
      <p>In Experiment 1, we present a target snippet written by the same author as the
source as well as three others written by different authors as the source.</p>
      <p>In Experiment 2, we provide the annotators with one target article written by the
same author as the source, one target article from a different author but published
in the same journal as the source, and two targets written by different authors and
published in different journals as the source. This experiment is designed to
capture any correlation from the writing style within the same journal,
presumably within the same scientific topic.</p>
      <p>In Experiment 3, we want to gain as much information as possible from the
user’s thinking while ranking based on the similarity. Thus, we show four target
snippets written by different authors as the one of the source snippet, while still
suggesting to the annotator that one of the targets is written by the same author as
the source.</p>
      <p>To conduct an exploratory data analysis, we presented the same set of experiments (“task”) to
three different annotators (see Figure 2). In the last design step of our pilot study, we selected
90 random snippets from the PubMed database as candidate source snippets. We indexed the
database of the snippets from single authors by stemming the words and removing the
stopwords. We assigned 30 snippets to each of the categories of the experiments, and we
performed for each of them a search according to the request:
• In Experiment 1, we searched for 10 similar articles from the same author and 100
from different authors.
• In Experiment 2, we searched for 10 similar articles from the same author, 10 from the
same journal but different author, and 100 from different authors and journals
• In Experiment 3, we searched for 100 similar articles from different authors.
Based on these results, we perform a cosine similarity between the vector of the words with
the source snippet and select the most similar ones accordingly to the experiment description.
This way, the topological information should be removed as a source of information for
authorship identification. For example, for Experiment 1 we selected the most similar article
from the same author and 3 from different authors. Additionally we also apply a manual
check and remove experiments that we assume to contain topological hints for texts written
by the same author (mainly based on keywords or phrases). At the end of this phase, we chose
66 experiments (22 per each category of experiments previously described).
The pilot study was performed using the crowed-sourcing platform CrowdFlower3. The
platform provides workforce from different countries helping to label and to enrich data. In
the next section, we present the outcome of this study as well as analysis of the result.
3 https://www.crowdflower.com/ (last accessed Feb. 12th, 2017)</p>
    </sec>
    <sec id="sec-4">
      <title>Results</title>
      <p>The job in CrowdFlower was performed by 56 different annotators from 29 different
countries. Being our goal to rank based on the writing style, the level of understanding
English isn't of a big concern for the task. To avoid random selection, we have configured the
system to disallow annotation in less than 20 seconds.</p>
      <p>At first glance, the annotators have a small agreement in the ranking of the similarity between
source and target snippets. Without considering the rank itself, it was achieved a full
agreement in 26 targets, 160 have an agreement of two annotators, and 78 of the targets have
no agreement at all. For a more detailed analysis, we use Krippendorff’s alpha measure to
determine the inter-rater agreement for the ranking of each target. This was computed using
the library “DK-Pro statistics”4. The results show:
• An Inter-rater Agreement of 0.299
• An Observed Disagreement of 0.699
• An Expected Disagreement of 0.999
We continue to explore the annotator’s rank by considering the snippets written by the same
author and those written within the same journal (but by different authors). Table 1 shows the
amount of times users selected the articles in each category of similarity.</p>
      <sec id="sec-4-1">
        <title>Snippet/Ranking</title>
      </sec>
      <sec id="sec-4-2">
        <title>Most Similar</title>
      </sec>
      <sec id="sec-4-3">
        <title>Similar</title>
      </sec>
      <sec id="sec-4-4">
        <title>Less Similar Least Similar</title>
        <p>Same Author
Same Journal
25
14
60
27
31
17
16
8</p>
        <p>As we can notice the agreement is low and the annotators mainly fail to recognize the same
author and the same journal. To further investigate the annotators' ranking behaviour, we
make a visual analysis between the ranking of the users, the content similarity, and the
stylometric similarity.</p>
        <p>Feature
alpha-chars-ratio
digit-chars-ratio
upper-chars-ratio
white-chars-ratio
type-token-ratio
hapax-legomena
hapax-dislegomena
yules-k
simpsons-d
brunets-w
sichels-s
honores-h
average-word-length
average-sentence-char-length
average-sentence-word-length</p>
        <p>
          Description
the fraction of total characters in the paragraph which are letters
the fraction of total characters in the paragraph which are digits
the fraction of total characters in the paragraph which are upper-case
the fraction of total characters in the paragraph which are whitespace
characters
ratio between the size of the vocabulary (i.e., the number of different
words) and the total number of words
the number of words occurring once
the number of words occurring twice
a vocabulary richness measure defined by Yule
a vocabulary richness measure defined by Simpson
a vocabulary richness measure defined by Brunet
a vocabulary richness measure defined by Sichel
a vocabulary richness measure defined by Honore
average length of words in characters
average length of sentences in characters
average length of sentences in words
First, we select a list of stylometric features to extract from the source and the target texts.
The literature suggests a broad amount of stylometric features
          <xref ref-type="bibr" rid="ref20">(Mosteller &amp; Wallace, 1964;
Tweedie &amp; Baayen, 2002; Stamatatos, 2009)</xref>
          . Table 2 presents the list of features we extract
for each snippet. In addition, we calculate the minimum, maximum, average and variance for
each of those features across every snippet.
        </p>
        <p>We consider the similarity between the source and the targets as a cosine similarity between
the stylometric feature vectors. As depicted in Figure 3, we created box-plots to study whether
there is a correlation between the user agreement and the content similarity (a) and one
between the user agreement and the writing style similarity (b).</p>
        <p>(a)
(b)</p>
        <p>There is no clear evidence that explains the agreement/disagreement among annotators from
the considered features. To dig more deeply, in Figure 4 we created a scatter plot in order to
comprehend whether there is a correlation between the three similarities, i.e. content
similarity, the writing style similarity and the inter-rater agreement.
T h e s c a t t e r p l o t d o e s n o t p r o v i d e any v i sua l h i nt a bo u t th e a nn ot a t or s'
agreement/disagreement. In addition, we plotted every combination considering, instead of
the whole vector of the aforementioned features (see Table 2), each of them singularly. Yet,
we didn't notice any clear pattern. As there is no additional information added to the previous
plot we omitted them in this paper.</p>
        <p>Finally, we empirically measured whether the annotators did their ranking in a random
manner. We run 500.000 rounds of random studies and, for each of them, calculate the
interrater agreement using the Krippensdorff’s alpha. The results show an average of 0.250 with a
variance of 0.020. 28% of the cases have a larger agreement than our study, thus we can
conclude with a confidence of 72% that the annotators in our experiment didn't rank in a
random manner.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Conclusion and Future Work</title>
      <p>In this paper, we proposed to include author's writing style as a new dimension of retrieving
scientific literature. As preparation to automate this process, we have conducted a pilot study
to learn whether the humans can distinguish text written by the same author when the
topological information is removed. Our analyses show that this is challenging task, and there
is no clear indicator for the choices of the humans. We provide the dataset of this study for the
research community to make further investigations. In future work, we also plan to extend the
study by increasing and diversifying the set of experiments aiming to capture, from human
annotators, properties of the thinking process while performing this task.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgements</title>
      <p>The Know-Center is funded within the Austrian COMET Program under the auspices of the
Austrian Ministry of Transport, Innovation and Technology, the Austrian Ministry of
Economics and Labour and by the State of Styria. COMET is managed by the Austrian
Research Promotion Agency FFG.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>Abacha</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          &amp;
          <string-name>
            <surname>Zweigenbaum</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          (
          <year>2011</year>
          ).
          <article-title>Medical entity recognition: a comparison of semantic and statistical methods</article-title>
          .
          <source>BioNLP 2011 Workshop</source>
          . Association for Computational Linguistics.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>Bergsma</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Post</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Yarowsky</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          (
          <year>2012</year>
          ).
          <article-title>Stylometric analysis of scientific articles</article-title>
          .
          <source>Proceedings of Human Language Technology Conference / North American Chapter of the Association for Computational Linguistics.</source>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>Choi</surname>
            ,
            <given-names>F.Y.</given-names>
          </string-name>
          (
          <year>2000</year>
          ).
          <article-title>Advances in domain independent linear text segmentation</article-title>
          .
          <source>Proceedings of the 1st North American chapter of the Association for Computational Linguistics</source>
          . pp.
          <fpage>26</fpage>
          -
          <lpage>33</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>Corney</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Buxton</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Langdon</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          &amp;
          <string-name>
            <surname>Jones</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          (
          <year>2004</year>
          ).
          <article-title>BioRAT: extracting biological information from full-length papers</article-title>
          .
          <source>Bioinformatics</source>
          ,
          <volume>20</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <surname>Dabrowska</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          &amp;
          <string-name>
            <surname>Larsen</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          (
          <year>2015</year>
          ).
          <article-title>Exploiting citation contexts for physics retrieval</article-title>
          .
          <source>3rd International Workshop on Bibliometric-enhanced Information Retrieval (BIR).</source>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <surname>Dias</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          &amp;
          <string-name>
            <surname>Alves</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          (
          <year>2005</year>
          ).
          <article-title>Unsupervised topic segmentation based on word co-occurrence and multi-word units for text summarization</article-title>
          .
          <source>Proceedings of the ELECTRA Workshop</source>
          associated to 28th
          <source>ACM SIGIR Conference</source>
          , Salvador, Brazil. pp.
          <fpage>41</fpage>
          -
          <lpage>48</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <surname>Eck</surname>
            ,
            <given-names>N.J.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Waltman</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          (
          <year>2014</year>
          ).
          <article-title>Systematic Retrieval of Scientific Literature based on Citation Relations: Introducing the CitNetExplorer Tool</article-title>
          . 2nd International Workshop on Bibliometricenhanced Information Retrieval (BIR).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <surname>Harpalani</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hart</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Singh</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          , Johnson,
          <string-name>
            <given-names>R.</given-names>
            &amp;
            <surname>Choi</surname>
          </string-name>
          ,
          <string-name>
            <surname>Y.</surname>
          </string-name>
          (
          <year>2011</year>
          ).
          <article-title>Language of vandalism: Improving wikipedia vandalism detection via stylometric analysis</article-title>
          .
          <source>Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies</source>
          . pp.
          <fpage>83</fpage>
          -
          <lpage>88</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <surname>Hearst</surname>
            ,
            <given-names>M.A.</given-names>
          </string-name>
          (
          <year>1997</year>
          ).
          <article-title>TextTiling: Segmenting text into multi-paragraph subtopic passages</article-title>
          .
          <source>Computational Linguistics</source>
          <volume>23</volume>
          (
          <issue>1</issue>
          ), pp.
          <fpage>33</fpage>
          -
          <lpage>64</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <surname>Holmes</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          (
          <year>1998</year>
          ).
          <article-title>The Evolution of Stylometry in Humanities Scholarship</article-title>
          .
          <source>Literary and Linguistic Computing</source>
          ,
          <volume>13</volume>
          (
          <issue>3</issue>
          ):
          <fpage>111</fpage>
          -
          <lpage>117</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <surname>Juola</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          (
          <year>2008</year>
          ).
          <article-title>Authorship attribution</article-title>
          .
          <source>Foundations and Trends in Information Retrieval</source>
          ,
          <volume>1</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <string-name>
            <surname>Kern</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jack</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hristakeva</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Granitzer</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          (
          <year>2012</year>
          ).
          <article-title>TeamBeam: Meta-data extraction from scientific literature</article-title>
          .
          <string-name>
            <surname>D-Lib</surname>
            <given-names>Magazine</given-names>
          </string-name>
          ,
          <volume>18</volume>
          (
          <issue>7</issue>
          ),
          <fpage>1</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <string-name>
            <surname>Liakata</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Saha</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dobnik</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Batchelor</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          &amp;
          <string-name>
            <surname>Rebholz-Schuhmann</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          (
          <year>2012</year>
          ).
          <article-title>Automatic recognition of conceptualization zones in scientific articles and two life science applications</article-title>
          .
          <source>Bioinformatics</source>
          <volume>28</volume>
          (
          <issue>7</issue>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <string-name>
            <surname>Mayr</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Scharnhorst</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Larsen</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schaer</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Mutschke</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          (
          <year>2014</year>
          ).
          <article-title>Bibliometric-enhanced information retrieval</article-title>
          .
          <source>Advances in Information Retrieval</source>
          (pp.
          <fpage>798</fpage>
          -
          <lpage>801</lpage>
          ). Springer International Publishing.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <string-name>
            <surname>Mendenhall</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          (
          <year>1887</year>
          ).
          <article-title>The characteristic curves of composition</article-title>
          . Science, ns-
          <volume>9</volume>
          (
          <issue>214S</issue>
          ):
          <fpage>237</fpage>
          -
          <lpage>246</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <string-name>
            <surname>Peng</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          &amp;
          <string-name>
            <surname>McCallum</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          (
          <year>2004</year>
          ).
          <article-title>Accurate information extraction from research papers using conditional random fields</article-title>
          .
          <source>Proceedings of Human Language Technology Conference / North American Chapter of the Association for Computational Linguistics</source>
          , pp.
          <fpage>329</fpage>
          -
          <lpage>336</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <string-name>
            <surname>Ravenscroft</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liakata</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          &amp;
          <string-name>
            <surname>Clare</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          (
          <year>2013</year>
          ).
          <article-title>Partridge: An effective system for the automatic classification of the types of academic papers</article-title>
          .
          <source>AI-2013: The 33rd SGAI International Conference.</source>
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          <string-name>
            <surname>Rexha</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Klampfl</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kröll</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          &amp;
          <string-name>
            <surname>Kern</surname>
          </string-name>
          . R. (
          <year>2015</year>
          ).
          <article-title>Towards authorship attribution for bibliometrics using stylometric features</article-title>
          .
          <source>Proceedings of the Workshop Mining Scientific Papers: Computational Linguistics and Bibliometrics, 15th International Society of Scientometrics and Informetrics Conference (ISSI)</source>
          , Istanbul, Turkey, pp.
          <fpage>44</fpage>
          -
          <lpage>49</lpage>
          . http://ceur-ws.org,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          <string-name>
            <surname>Rexha</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Klampfl</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kröll</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          &amp;
          <string-name>
            <surname>Kern</surname>
          </string-name>
          . R. (
          <year>2016</year>
          ).
          <article-title>Towards a more fine grained analysis of scientific authorship: Predicting the number of authors using stylometric features</article-title>
          .
          <source>4th International Workshop on Bibliometric-enhanced Information Retrieval (BIR).</source>
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          <string-name>
            <surname>Stamatatos</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          (
          <year>2009</year>
          ).
          <article-title>A survey of modern authorship attribution methods</article-title>
          .
          <source>Journal of the American Society for Information Science and Technology</source>
          ,
          <volume>60</volume>
          (
          <issue>3</issue>
          ):
          <fpage>538</fpage>
          -
          <lpage>556</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          <string-name>
            <surname>Tsuruoka</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tsujii</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          &amp;
          <string-name>
            <surname>Ananiadou</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          (
          <year>2008</year>
          ).
          <article-title>FACTA: A text search engine for finding associated biomedical concepts</article-title>
          .
          <source>Bioinformatics</source>
          <volume>24</volume>
          (
          <issue>21</issue>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          <string-name>
            <surname>Tweedie</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          &amp;
          <string-name>
            <surname>Baayen</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          (
          <year>1998</year>
          ).
          <article-title>How variable may a constant be? Measures of lexical richness in perspective</article-title>
          .
          <source>Computers and the Humanities</source>
          . pp.
          <fpage>323</fpage>
          -
          <lpage>352</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          <string-name>
            <surname>Zweigenbaum</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Demner-Fushman</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yu</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Cohen</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <year>2007</year>
          .
          <article-title>Frontiers of biomedical text mining: Current progress</article-title>
          .
          <source>Briefings in Bioinformatics</source>
          ,
          <volume>8</volume>
          (
          <issue>5</issue>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>