<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Analysis of Term Roles Along Taxonomy Nodes by Adopting Discriminant and Characteristic Capabilities</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Giuliano Armano</string-name>
          <email>armano@diee.unica.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Francesca Fanni</string-name>
          <email>francesca.fanni@diee.unica.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alessandro Giuliani</string-name>
          <email>alessandro.giuliani@diee.unica.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Electrical and Electronic Engineering (DIEE) University of Cagliari</institution>
          ,
          <addr-line>via Marengo 2, 09123 Cagliari</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Taxonomies are becoming essential to a growing number of application, particularly for speci c domains. Taxonomies, originally built by hand, have been recently focused on their automatic generation. In particular, a main issue on automatic taxonomy building regards the choice of the most suitable features. In this paper, we propose an analysis on how each feature changes its role along taxonomy nodes in a text categorization scenario, in which the features are the terms in textual documents. We deem that, in a hierarchical structure, each node should intuitively be represented with proper meaningful and discriminant terms (i.e., performing a feature selection task for each node), instead of considering a xed feature space. To assess the discriminant power of a term, we adopt two novel metrics able to measure it. Our conjecture is that a term could signi cantly change its discriminant power (hence, its role) along the taxonomy levels. We perform experiments aimed at proving that a signi cant number of terms play di erent roles in each taxonomy node, giving emphasis to the usefulness of a distinct feature selection for each node. We assert that this analysis should support automatic taxonomy building approaches.</p>
      </abstract>
      <kwd-group>
        <kwd>Discriminant Capability</kwd>
        <kwd>Characteristic Capability</kwd>
        <kwd>Taxonomy</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        In this paper, the underlying scenario is text categorization, where source items
are textual documents (e.g., webpages, online news, scienti c papers, or e-books).
In particular, this work is part of a bigger project concerning the automatic
taxonomy building. Taxonomies are becoming essential to a growing number
of application, particularly for speci c domains. They play an important role
in many applications. For example, in web search, organizing domain-speci c
queries into hierarchies can help to better understand the queries and improve
search result [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], or to improve query re nement [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. Taxonomies, originally
built by hand, have been recently focused on their automatic generation. In
particular, a main issue on automatic taxonomy building regards the choice of
the most suitable features (i.e., the terms in the textual documents). We deem
that, in a hierarchical structure, each node should intuitively be represented with
proper discriminant terms, i.e., performing a feature selection task for each node,
instead of considering a xed feature space for the entire taxonomy. To assess
the discriminant power of a term, we use novel metrics able to measure it. The
adopted metrics are the discriminant capability, that grows in accordance with
the ability to distinguish a given category against others, and the characteristic
capability, that grows in accordance to how the term is frequent and common
over all categories. Our conjecture is that a term could change its role, depending
on its discriminant power, along the taxonomy levels. We perform experiments
aimed at analyzing such changes of role along taxonomy nodes.
      </p>
      <p>The rest of the paper is organized as follows: Section 2 regards the background
of this work; in Section 3 the adopted metrics are described, whereas in Section 4
the methodology of terms roles analysis is explained; experiments are reported
in Section 5, and Section 6 ends the paper with the conclusions and the future
work.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Background</title>
      <p>
        Recently there have been a focus on the automatic taxonomy building. The
motivations are obvious: manual construction is a laborious process, and the resulting
taxonomy is often highly subjective, compared with taxonomies built by
datadriven approaches. Furthermore, automatic approaches potentially could enable
humans or even machines to understand a highly focused and fast changing
domains. Several works have been devoted to taxonomy induction, in particular
with respect to automatically creating a domain-speci c ontology or taxonomy
[7{9]. In particular, an important task is to recognize the most meaningful
features. According to Luhn [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], only a relatively small number of terms, in a
document, is meaningful. In fact, most terms, in a corpus, are non-informative.
There are two types of terms that are meaningless for representing a topic or a
category: (i) terms that occur only in a few number of documents, and (ii) terms
that frequently occur in a document collection (the so-called stopwords );
stopwords are mainly pronouns, articles, prepositions, conjunctions, some frequent
verbs forms, etc. [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. Several works in the literature focused on the analysis
of stopwords in document collections [3{5, 10], proving that stopwords tend to
occur in the majority of domain documents and introduce noise for IR tasks [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
For these reasons, stopwords should be ltered in the document representation
process, since they actually reduce retrieval e ectiveness. Our insight is that, due
to the hierarchical structure, each node intuitively should be represented with
proper meaningful terms, instead of considering a xed vocabulary for the
entire structure. In other words, we deem that each document collection is unique,
making useful to devise methods and algorithms able to automatically build a
distinct list of meaningful features for each collection.
      </p>
    </sec>
    <sec id="sec-3">
      <title>The Adopted Metrics</title>
      <p>
        In this paper, we adopt two metrics able to provide relevant information to
researchers in several IR and ML tasks. The metrics have been devised for both
classi ers performance assessment and feature selection tasks [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. As the most
acknowledged approaches do not assess the discriminant power of a term, we
apply the metrics for feature selection, in which they are able to evaluate the
discriminant and characteristic capabilities of each feature. In particular, as the
underlying scenario is text categorization, for each term the former measures
the ability to distinguish a given category C against others, whereas the latter
measures to which extent term is pervasive in the given set of documents. The
de nitions of discriminant ( ) and characteristic (') capabilities, in this scenario,
are the following [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]:
      </p>
      <p>=
' =
#(t; C)
#(C)
#(t; C)
#(C)
#(t; C)
#(C)
#(t; C)
#(C)
(1)
(2)
where a generic term t contained in a document represents the binary feature
under analysis, meaning that it can be assume two values, depending on the
presence or absence in the document. Table 1 reports the meaning of each
component in the formulas, in which the absence of term is denoted as t, and the
alternate class is denoted as C.</p>
      <p>
        Assuming both ranging from -1 to +1, the proposed metrics show an
orthogonal behavior, and it has been proved that the ' space is constrained by a
rhomboidal shape [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], as reported in Figure 1.
4
      </p>
    </sec>
    <sec id="sec-4">
      <title>Terms Roles</title>
      <p>In this context, a term plays a distinct role in each category, depending on the
rhombus region in which the term falls. Important terms for text classi cation
space.
appear in upper and lower corner of the rhombus in Figure 1, as they have
high values of j j. In particular, a high positive value of means that the term
frequently occurs in C and is rare in C; ideally, is +1 when the term occurs
in all documents of C and no documents of C contain it. Conversely, a high
negative value of means that the term frequently occurs in C and is rare in
C; ideally, = 1 means that all documents of C contain the term, and no
documents of C contain it. As for the characteristic capability, terms that occur
barely on the entire domain are expected to appear in the left corner of the
rhombus (high negative values of '), while stopwords are expected to appear in
the right handed corner (high positive value of '). Ideally, ' = +1 when the
term occurs in each document of the entire domain, whereas ' = 1 when the
term is completely absent in the domain. Figure 2 outlines the expected behavior
for all cases.</p>
      <p>
        Terms falling in the right handed corner do not necessarily represent typical
stopwords only (i.e., common articles, nouns, conjunctions, verbs, and adverbs).
Rather, also domain-dependent stopwords are located in that area [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>The aim of this work is to analyze the roles of terms in di erent levels of
taxonomies; in particular, we are interested in verifying how terms change their
role in the ancestor nodes. For example, let us consider the domain \Sport",
containing the categories \Volleyball", \Basket", and \Football"; intuitively, for the
given domain, the term ball should be considered a domain-dependent stopword,
as it is not relevant to discriminate among the cited categories. The term could
change its role in the parent node. For example, if the parent node belongs to a
set of siblings built with the categories \Sport", \Music", and \Economy", the
term ball becomes, intuitively, discriminant for the category "Sport", as reported
in Figure 3.</p>
      <p>In so doing, we perform a further analysis of the rhombus area de ned in the
space. We rst introduce the behavior of the metrics in the neighborhoods
of origin. Theoretically, if a term has a zero value for both and ' in a given
category C, it is equally distributed in the domain in this way: half of C
documents contain the term, and also half of documents of the alternate category C
contain the term. If a term is projected close to the origin of the space, there is
uncertainty in considering the term as stopword, irrelevant, or discriminant. An
analysis of this region is a part of future work.</p>
      <p>We assign the following symbols to the regions of the rhombus:
{ +: the region in which highly positive discriminant terms are placed,
{ : the region in which highly negative discriminant terms are placed,
{ '+: the region in which global and domain-dependent stopwords are placed,
{ +: the region in which rare terms fall.
{ O: the region of uncertainty, where we cannot actually infer the real nature
of a term.</p>
      <p>With the goal of better understanding the roles of terms along the taxonomy
levels, we adopt a nite-state machine (FSM) representation, in which each
region de ned above represents a \state", whereas each transition represents the
change of region: the \current state" is the region in which a term falls for a
given node, and the \next state" is the region of the parent node rhombus in
which the term is placed. For instance, let us consider the example of Figure 3:
the term ball falls in the '+ region for the node \Basketball"; in the parent
node \Sport" the term becomes discriminant and falls in the + region. The
associated FSM example is reported in Figure 6.
We perform experiments aimed at analyzing the term roles along taxonomy
levels. In so doing, we want to verify that terms have di erent roles along taxonomy
nodes.
Experiments are performed using a collection of webpage documents. The dataset
is extracted from the DMOZ taxonomy1 (http://www.dmoz.org). A set of 174
categories containing about 20000 documents, organized in 36 domains, has been
chosen. Each domain consists in a set of siblings nodes. Aside from the leaves,
each node is built with the union of the children's documents. Textual
information from each page code is extracted, and each document is converted into a
bag of words representation, each word being weighted with two values: ' and
, computed by applying equations 1 and 2.
The purpose of the following experiments is to track term movements, i.e, for
each term, the change of region (a transition in the FSM model) when the focus
is moved from a node to its parent. We discarded the global stopwords in this
analysis, since we want to focus on domain-dependent and discriminant terms.
In the following FSM charts, each edge is marked with the number of terms that
participated to the associated transition. Figure 7 reports the transition for all
terms in the dataset, in which the value of (i.e., the parameter that controls
the size of the neutral region) is initially set to 0.12.
1 DMOZ is a collection of HTML documents referenced in a Web directory developed
in the Open Directory Project (ODP).
2 We performed analyses with other values of ; we did not report the charts for the
sake of brevity. The choice of proper shape and dimensions of the O region are
currently under study.</p>
      <p>In our opinion, the transitions regarding only a few number of terms should
be only due to statistical uctuations. For the sake of clarity, we did not report,
transitions marked with 1. The majority of movements, as expected, belongs to
the transition ' ! ' , meaning that most terms (more than 99% of them)
are rare or irrelevant (in accordance with the Zipf's law). This is intuitively
coherent with the fact that, in a parent node, a term belongs to a more populated
vocabulary; if a term is rare in a domain, it should remain rare (actually it should
be more rare) in a bigger domain.</p>
      <p>2973205
ϕ−
402
4
15
6
14
2
δ+</p>
      <p>O</p>
      <p>As for the other transitions, a term having a high value of ' may become
discriminant, as expected, in the parent node. This corresponds to the transition
'+ ! +. Figure 7 reports 37 '+ ! + transitions. This phenomenon sustains
the conjecture that a domain-dependent stopword may become discriminant in
the parent node. There is also a signi cant number of + ! ' transitions,
meaning that a discriminant term (positive side) tends to be irrelevant in the
parent node. As the parent node is built with the union of documents belonging
to its children, there is a higher population of terms. The chosen taxonomy has
a high average branching factor ( 5), hence, the frequency of a term is signi
cantly smaller in the parent node, and the term becomes rare. The same behavior
is observed in the ( ! ' ) transition for the same reasons (the smaller
number of transitions is due to a smaller number of negatively discriminant terms
than the positive ones in the entire dataset).</p>
      <p>Moreover, in the previous FSM charts, there is a signi cant number of !
+ transitions: although, at rst glance, it seems a strange behavior, it is actually
not surprising: let us consider the term \ball" in the children of Sport, as reported
in the example of Figure 8; most of them (Volley, Basket, Football, Rugby, and
Handball ) are sports played with a ball. On the other hand, there is a sibling
(Auto Racing ) in which the term \ball" is expected to occur barely. The term
is obviously negatively discriminant for Auto Racing, as it appears frequently in
the alternate class; on the other hand, the term should be signi cantly frequent
in the domain Sport, as it is expected to appear frequently in 5 siblings of 6;
furthermore, looking at the siblings of Sport, intuitively the term should be not
frequent in the alternate class, given by the union of Music and Economy. Hence,
the term \ball" should be positively discriminant for Sport.</p>
      <p>A further important property is that there are no '+ ! transitions; this
is an expected behavior: a negatively discriminant term in the parent node (
region) has very low frequency (0 occurrences in ideal case); this is in contrast
with the fact that a positive characteristic term (it falls in the '+ region) is
highly frequent in the entire domain.</p>
      <p>Taking into account the previous analyses, we can now hypothesize that
a domain-dependent stopword in a given node, probably becomes discriminant
when the focus is moved in upper levels of the taxonomy; subsequently it becomes
rare, and remains rare until the root of the taxonomy. Figure 9 reports the
associated FSM.</p>
      <p>As an example, let us consider the path highlighted in Figure 10; the node
Academic Department contains several domain-dependent stopwords (that is,
they belong to the '+ region).</p>
      <p>We expect that at least some of them become discriminant in the parent
node (Computer Science); moving up in the taxonomy path, we suppose they
become irrelevant (they fall in the ' region for the node Computers ). The
Figure 11 con rms this aspect, reporting the experimental results for the
previous example; the node Academic Department contains the domain-dependent
stopwords \computer", \research", \science", \university". The Figure shows
the placements of these terms in the ' space, for the given node and for
the ancestors highlighted in Figure 10; it is clear how each term has the total
transition '+ ! + ! ' .</p>
      <p>This behavior is an essential property of a hierarchically ordered set of
documents. It is well known that in classi cation tasks a feature selection process
improves the classi er performances; in a taxonomy, a feature (i.e., a term) may
assume di erent roles in each node. Figure 11 clearly shows that a distinct
feature selection task should be performed for each node, instead of considering a
global feature space for the entire taxonomy; the methodology will be based on
the selection of the most discriminant terms only, discarding irrelevant terms
and stopwords (global and domain-dependent). The metrics permit to identify
meaningful features to be selected in automatic taxonomy generation algorithms.
Furthermore, this conjecture is the starting point of future work on hierarchical
classi cation, in which the metrics will be adopted for performing local feature
selection tasks.
6</p>
    </sec>
    <sec id="sec-5">
      <title>Conclusion and Future Work</title>
      <p>We proposed an analysis on how each feature changes its role along taxonomy
nodes, in a text categorization scenario, in which the features are the terms in
the textual documents. Results proved that a signi cant number of terms have
di erent roles along taxonomy nodes, giving emphasis to the usefulness of a
proper feature selection for each node. The adopted metrics permit to identify
meaningful features to be selected in automatic taxonomy generation algorithms.
Currently we are developing a methodology based on this metrics. Furthermore,
this conjecture is the starting point of future work on hierarchical classi cation,
in which the metrics will be adopted for performing local feature selection tasks.
We are also currently planning to investigate di erent values of for de ning
the uncertainty region.</p>
      <p>Acknowledgments. This work has been supported by LR7 2009 - Investment
funds for basic research (funded by the local government of Sardinia).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Armano</surname>
          </string-name>
          , G.:
          <article-title>A direct measure of discriminant and characteristic capability for classi er building and assessment</article-title>
          .
          <source>Tech. rep., DIEE</source>
          , Department of Electrical and Electronic Engineering, University of Cagliari, Cagliari, Italy (
          <year>2014</year>
          ),
          <source>dIEE Technical Report Series</source>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Armano</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fanni</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Giuliani</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Stopwords identi cation by means of characteristic and discriminant analysis</article-title>
          . In: Loiseau,
          <string-name>
            <given-names>S.</given-names>
            ,
            <surname>Filipe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            ,
            <surname>Duval</surname>
          </string-name>
          ,
          <string-name>
            <surname>B.</surname>
          </string-name>
          , Van Den Herik, J. (eds.) 7th
          <source>International Conference on Agents and Arti cial Intelligence</source>
          <year>2015</year>
          (ICAART
          <year>2015</year>
          ). pp.
          <volume>353</volume>
          {
          <fpage>360</fpage>
          .
          <string-name>
            <given-names>SCITEPRESS</given-names>
            <surname>Science and Technology Publications</surname>
          </string-name>
          , Lisbon, Portugal (
          <volume>10</volume>
          {
          <issue>12</issue>
          <year>Jan 2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Francis</surname>
            ,
            <given-names>W.N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kucera</surname>
          </string-name>
          , H.:
          <article-title>Frequency Analysis of English Usage: Lexicon and Grammar</article-title>
          . Houghton Mi in (
          <year>1983</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Hart</surname>
            ,
            <given-names>G.W.</given-names>
          </string-name>
          :
          <article-title>To decode short cryptograms</article-title>
          .
          <source>Commun. ACM</source>
          <volume>37</volume>
          (
          <issue>9</issue>
          ),
          <volume>102</volume>
          {108 (Sep
          <year>1994</year>
          ), http://doi.acm.
          <source>org/10</source>
          .1145/182987.184078
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Kucera</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Francis</surname>
            ,
            <given-names>W.N.</given-names>
          </string-name>
          :
          <article-title>Computational analysis of present-day American English</article-title>
          . Brown University Press, Providence, RI (
          <year>1967</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Luhn</surname>
            ,
            <given-names>H.P.:</given-names>
          </string-name>
          <article-title>The automatic creation of literature abstracts</article-title>
          .
          <source>IBM J. Res. Dev</source>
          .
          <volume>2</volume>
          (
          <issue>2</issue>
          ),
          <volume>159</volume>
          {165 (Apr
          <year>1958</year>
          ), http://dx.doi.org/10.1147/rd.22.0159
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Mani</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          :
          <article-title>Automatically inducing ontologies from corpora</article-title>
          .
          <source>In: Proceedings of CompuTerm</source>
          <year>2004</year>
          : 3rd International Workshop on Computational Terminology, COLING'
          <year>2004</year>
          (
          <year>2002</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Navigli</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Velardi</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Faralli</surname>
            ,
            <given-names>S.:</given-names>
          </string-name>
          <article-title>A graph-based algorithm for inducing lexical taxonomies from scratch</article-title>
          .
          <source>In: Proceedings of the Twenty-Second international joint conference on Arti cial Intelligence - Volume Volume Three</source>
          . pp.
          <year>1872</year>
          {
          <year>1877</year>
          . IJCAI'11, AAAI Press (
          <year>2011</year>
          ), http://dx.doi.org/10.5591/978-1-
          <fpage>57735</fpage>
          -516-8/
          <fpage>IJCAI11</fpage>
          -313
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Poon</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Domingos</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Unsupervised ontology induction from text</article-title>
          .
          <source>In: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics</source>
          . pp.
          <volume>296</volume>
          {
          <fpage>305</fpage>
          . ACL '
          <volume>10</volume>
          ,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computational Linguistics, Stroudsburg, PA, USA (
          <year>2010</year>
          ), http://dl.acm.org/citation.cfm?id=
          <volume>1858681</volume>
          .
          <fpage>1858712</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Rijsbergen</surname>
            ,
            <given-names>C.J.V.: Information</given-names>
          </string-name>
          <string-name>
            <surname>Retrieval.</surname>
          </string-name>
          Butterworth-Heinemann, Newton, MA, USA, 2nd edn. (
          <year>1979</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Sadikov</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Madhavan</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Halevy</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Clustering query re nements by user intent</article-title>
          .
          <source>In: Proceedings of the 19th international conference on World wide web</source>
          . pp.
          <volume>841</volume>
          {
          <fpage>850</fpage>
          . WWW '10,
          <string-name>
            <surname>ACM</surname>
          </string-name>
          , New York, NY, USA (
          <year>2010</year>
          ), http://doi. acm.
          <source>org/10</source>
          .1145/1772690.1772776
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Silva</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ribeiro</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>The importance of stop word removal on recall values in text categorization</article-title>
          .
          <source>In: International Joint Conference on Neural Networks</source>
          ,
          <year>2003</year>
          . vol.
          <volume>3</volume>
          , pp.
          <volume>1661</volume>
          {
          <issue>1666</issue>
          (
          <year>2003</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>White</surname>
            ,
            <given-names>R.W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bennett</surname>
            ,
            <given-names>P.N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dumais</surname>
          </string-name>
          , S.T.:
          <article-title>Predicting short-term interests using activity-based search context</article-title>
          .
          <source>In: Proceedings of the 19th ACM international conference on Information and knowledge management</source>
          . pp.
          <volume>1009</volume>
          {
          <fpage>1018</fpage>
          . CIKM '10,
          <string-name>
            <surname>ACM</surname>
          </string-name>
          , New York, NY, USA (
          <year>2010</year>
          ), http://doi.acm.
          <source>org/10</source>
          .1145/1871437. 1871565
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>