<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>University of Padua at CLEF 2002: Experiments to evaluate a statistical stemming algorithm</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Michela Bacchin</string-name>
          <email>michela.bacchin@unipd.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nicola Ferro</string-name>
          <email>nicola.ferro@unipd.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Massimo Melucci</string-name>
          <email>massimo.melucci@unipd.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Information Engineering - University of Padua Via Gradenigo 6/a - 35131 Padova</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2001</year>
      </pub-date>
      <abstract>
        <p>In Information Retrieval (IR), stemming is used to reduce variant word forms to common root. The assumption is that if two words have the same root, then they represent the same concept. Hence stemming permits a IR system to match query and document terms which are related to a same meaning but which can appear in different morphological variants. In this paper we will report our participation in CLEF 2002 Italian monolingual task, whose aim was to evaluate a statistical stemming algorithm based on link analysis. Considering that a word is formed by a prefix (stem) and a suffix, the key idea is that the interlinked prefixes and suffixes form a community of substrings. Hence discovering these communities means searching for the best word splits which give the best word stems. The results show that stemming improves the IR effectiveness. They also show that effectiveness level of our algorithm, which does not incorporate any heuristics nor linguistic knowledge, is comparable to that of an algorithm based on a-priori linguistic knowledge. This is an encouraging result, particularly in a multi-lingual context.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
    </sec>
    <sec id="sec-2">
      <title>Stemming</title>
      <p>Stemming is used to reduce variant word forms to common root. The assumption is that if two words
have the same root, then they represent the same concept. Hence stemming permits a IR system to
match query and document terms which are related to a same meaning but which can appear in different
morphological variants.</p>
      <p>
        The effectiveness of stemming is a debated issue, and there are different results and conclusions. If
effectiveness is measured by the traditional precision and recall measures, it seems that for a language with
a relatively simple morphology, like English, stemming influences the overall performance little [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. In
contrast, stemming can significantly increase the retrieval effectiveness [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] and can also increase precision
for short queries, [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] for languages with a more complex morphology, like the romance languages. Finally,
as the system performance must reflect user’s expectations it has to be considered that the use of a
stemmer is intuitive to many users [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], who can express the query to the system using a specific word
without keeping in mind that only a variant of this word can appear in a relevant document. Hence,
stemming can be viewed also as a sort of feature related to the user-interaction interface of an IR service.
      </p>
      <p>To design a stemming algorithm, it is possible to follow a linguistic approach, using prior knowledge of
the morphology of the specific language, or a statistical approach using some methods based on statistical
principles to infer from the corpus of documents the word formation rules in the language studied. The
former implies manual labor which has to be done by experts in linguistics – as matter of the fact,
it is necessary to formalize the word formation rules, the latter being hard work, especially for those
languages whose morphology is complex. Stemming algorithms based on statistical methods ensure no
costs for inserting new languages on the system, and this is an advantage that becomes crucial especially
for multilingual IR systems.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Methodological Approach</title>
      <p>
        We will consider a special case of stemming, which belongs to the category known as affix removal
stemming [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. In particular our approach stays on a suffix stripping paradigm which is adopted by most
stemmers currently in use by IR, like those reported in [
        <xref ref-type="bibr" rid="ref12 ref15 ref9">9, 12, 15</xref>
        ]. This stemming process splits each word
into two parts, prefix and suffix, and considers the stem as the substring corresponding to the obtained
prefix. Let us consider a finite collection of unique words W = {w1, ..., wN } and a word w ∈ W of length
|w|, then w can be written as w = xy where x is a prefix and y is a suffix. If we split each word w into
all the |w| − 1 possible pairs of substrings, we build a collection of substrings, and each substring may
be either a prefix, a suffix or both of at least an element w ∈ W . Let X be the set of the prefixes of the
collection and S ⊆ X be the set of the stems. We are interested in detecting the prefix x that is the most
probable stem for the observed word w. Hence, we have to determine the prefix x∗ such as:
x∗
=
=
arg max P r(x ∈ S | w ∈ W )
      </p>
      <p>x
arg max P r(w ∈ W | x ∈ S)P r(x ∈ S)
x P r(w ∈ W )
(1)
(2)
where (2) is obtained applying the Bayes’ theorem which lets us swap the order of dependence between
events. We can ignore the denominator, which is the same for all splits of w. P r(w ∈ W | x ∈ S)
is the probability of observing w given that the stem x has been observed. A reasonable estimation of
that probability would be the reciprocal of the number of words beginning by that stem if the stems
were known. However note that the stems are unknown – indeed stem detection is the target of this
method – and the number of words beginning by a stem cannot be computed. Therefore we estimated
that probability by the reciprocal of the number of words beginning by that prefix. As regards P r(x ∈ S)
we estimated this probability using an algorithm that discloses the mutual relationship between stems
and derivations in forming the words of the collection.</p>
      <p>The rationale of using mutual reinforcement is based on the idea that stems extracted from W are
those substrings that:
– are very frequent, and
– form words together with very frequent suffixes.</p>
      <p>This means that very frequent prefixes are candidate to be stems, but they are discarded if they are
not followed by very frequent suffixes; for example, all initials are very frequent prefixes but they are
unlikely stems because the corresponding suffixes are rather rare, if not unique – the same holds for
suffixes corresponding to ending vowels or consonants. Thus, there are prefixes being less frequent than
initials, but followed by frequent suffixes, yet less frequent than ending characters: these suffixes and
prefixes correspond to candidate correct word splits and we label them as “good”. The key idea is that
interlinked good prefixes and suffixes form a community of substrings whose links correspond to words,
i.e. to splits. Discovering these communities is like searching for the best splits.</p>
      <p>
        To compute the best split, we used the quite well-known algorithm called HITS reported in [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] and
often discussed in many research papers as a paradigmatic algorithm for Web page retrieval. It considers
a mutually reinforcing relationship among good authorities and good hubs, where an authority is a web
page pointed to by many hubs and a hub is a web page which points to many authorities. The parallel
with our context will be clear when we associate the concept of a hub to a prefix and that of authority
to a suffix. The method belongs to the larger class of approaches based on frequencies of substrings to
decide the goodness of prefixes and suffixes, often used in statistical morphological analysis [
        <xref ref-type="bibr" rid="ref11 ref4">11, 4</xref>
        ], and in
the pioneer work [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. The contribution of this paper is the use of mutual reinforcement notion applied to
prefix frequencies and suffix frequencies, to compute the best word splits which give the best word stems
as explained in the following.
      </p>
      <p>Using a graphical notation, the set of prefixes and suffixes can be written as a graph g such that nodes
are substrings and an edge occurs between nodes x, y if w = xy is a word in W . By definition of g, no
vertex is isolated. As an example, let us consider the following toy set of words: W ={aba, abb, baa};
splitting these into all the possible prefixes and suffixes produces a graph, reported in Figure 3a.
(a)
substring
a
aa
ab
b
ba
bb</p>
      <p>Let us define P (y) = {x : ∃w, w = xy} and S(x) = {y : ∃w, w = xy} that are, respectively, the set
of all prefixes of a given suffix y and the set of all suffixes of a given prefix x. If px and sx indicate,
respectively, the prefix score and the suffix score, the criteria can be expressed as:
px =</p>
      <p>X
y∈S(x)
sy
sy =</p>
      <p>X
x∈P (y)
px
(3)
under the assumption that scores are expressed as sums of scores and splits are equally weighed.</p>
      <p>The method of mutual reinforcement has been formalized through the HITS iterative algorithm. Here
we map HITS in our study context, as follows:
Compute suffix scores and prefix scores from W
V : the set of substrings extracted from all the words in W
P (y): the set of all prefixes of a given suffix y
S(x): the set of all suffixes of a given prefix x
N : the number of all substrings in V
n: the number of iterations
1: the vector (1, ..., 1) ∈ R|V |
0: the vector (0, ..., 0) ∈ R|V |
s(k): suffix score vector at step k
p(k): prefix score vector at step k
s(0) = 1
p(0) = 1
for each iteration k = 1, ..., n
s(k) = 0
p(k) = 0
for each y ∈ V
for each x ∈ V
end.</p>
      <p>s(yk) = Px∈P (y) p(xk−1);
p(xk) = Py∈S(x) s(yk);
normalize p(k) and s(k) so that 1 = Px p(xk) = Py s(yk)
Using the matrix notation, the graph g can be described with a |V | × |V | matrix M such that
mij =
(1 if prefix i and suffix j form a word</p>
      <p>
        0 otherwise
As explained in [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], the algorithm computes two matrices: A = MT M and B = MMT , where the generic
element aij of A is the number of vertices that are pointed by both i and j, whereas the generic element
bij of B is the number of vertices that point to both i and j. The n-step iteration of the algorithm
corresponds to computing An and Bn. In the same paper, it has been argued that s = [sy] and p = [px]
converge to the eigenvectors of A and B, respectively. The scores computed for the toy set of words are
reported in Table 3b.
      </p>
      <p>
        As explained previously, we argue that the probability that x is a stem, can be estimated with the
prefix score px just calculated. The underlying assumption is that the scores can be seen as probabilities,
and, in effect, it has been proved in a recent work that HITS scores can be considered as a stationary
distribution of a random walk [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. In particular, the authors proved the existence of a Markov chain,
which has the stationary distribution equal to the hub vector after the nth iteration of the Kleinberg’s
algorithm, which is, in our context, the prefix score vector p = [px]. The generic element qi(jn) of the
transition matrix referred to the chain is the probability that, starting from i, one reaches j after n
“bouncing” to one of the suffixes which begins to be associated with i and j. To interpret the result in a
linguistic framework, pi can be seen as the probability that i is judged as a stem by the same community
of substrings (suffixes) being resulted by the process of splitting words of a language. In Table 1, all the
possible splits for all the words are reported and measured using the estimated probability.
4
      </p>
    </sec>
    <sec id="sec-4">
      <title>Experiments</title>
      <p>The aim of CLEF 2002 experiments is to compare the retrieval effectiveness of the link analysis-based
algorithm illustrated in the previous Section with that of an algorithm based on a-priori linguistic
knowledge, because the hypothesis is that a language-independent algorithm, such as the one we propose, might
effectively replace one developed on the basis of manually coded derivational rules. Before comparing
the algorithms, we assessed the impact of both stemming algorithms by comparing their effectiveness
with that reached without any stemmer. In fact, we did want to test if the system performance is not
word
prefix</p>
      <p>
        suffix
baa
baa
aba
aba
abb
abb
b
ba
a
ab
a
ab
aa
a
ba
a
bb
b
words beginning
by prefix
1
1
2
2
2
2
significantly hurt by the application of stemming, as hypothesized in [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. If stemming did on the contrary
improve effectiveness, and the effectiveness of the tested algorithms were comparable, the link-based
algorithm would ensure low costs for extending it also to other languages, which is crucial in multi-lingual
settings. To evaluate stemming, we decided to compare the performance of an IR system changing only
the stemming algorithms for different runs, all other things being equal.
For indexing and retrieval, we used an experimental IR system, called IRON, which has been realized by
our research group with the aim of having a robust tool for carrying out IR experiments. IRON is built
on top of the Lucene 1.2 RC4 library, which is an open-source library for IR written in Java and publicly
available in [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. The system implements the vector space model [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ], and a (tf · idf)–based weighting
scheme [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]. The stop-list which was used consists of 409 Italian frequent words and it is publicly available
in [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ].
      </p>
      <p>As regards the realization of the statistical stemming algorithm, we built a suite of tools, called
Stemming Program for Language Independent Tasks (SPLIT), which implements the link-based algorithm
and chooses the best stem, according to the probabilistic criterion described in Section 3. From the
vocabulary of the Italian CLEF sub-collection, SPLIT spawns a 2,277,297-node and 1,215,326-edge graph,
which is processed to compute prefix and suffix scores – SPLIT took 2.5 hours for 100 iterations on a
personal computer equipped with Linux, an 800 MHz Intel CPU and 256MB RAM.
4.2</p>
    </sec>
    <sec id="sec-5">
      <title>Runs</title>
      <p>
        We tested four different stemming algorithms:
1. NoStem: No stemming algorithm was applied.
2. Porter-like: We used the stemming algorithm for the Italian language, which is freely available
in the Snowball Web Site [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] edited by M. Porter. Besides being publicly available for research
purposes, we have chosen this algorithm because it uses a kind of a-priori knowledge of the Italian
language, so comparing our SPLIT algorithm with this particular “linguistic” algorithm could give
some information about the possibility of estimating linguistic knowledge with statistically inferred
knowledge.
3. SPLIT: We implemented our first version of the stemming algorithm based on a link-analysis with
100 iterations.
4. SPLIT-L3: We included in our stemming algorithm a little ignition of linguistic knowledge,
inserting a heuristic rule which forces the length of the stem to be at least 3.
4.3
      </p>
    </sec>
    <sec id="sec-6">
      <title>A Global Evaluation</title>
      <p>We carried out a macro evaluation by averaging the results over all the queries of the test collection.
Table 2 shows a summary of the figures related to the macro analysis of the stemming algorithm for 2002
topics, while table 3 reports 2001 data.</p>
      <p>Run ID
PDDN
PDDP
PDDS2PL
PDDS2PL3</p>
      <p>Algorithm
NoStem
Porter-like
SPLIT
SPLIT-L3</p>
      <p>Note that both for 2001 and 2002 topics, all the considered stemming algorithms improve recall, since
the number of retrieved relevant documents is larger than the number of retrieved relevant documents
observed in the case of retrieval without any stemmer; the increase has been observed for all the
stemming algorithms. As regards the precision, while for 2002 topics, stemming does not hurt the overall
performances of the system, for 2001 data, stemming even increases the precision, and then the overall
performance is higher thanks to the application of stemming.</p>
      <p>Figure 2 shows the Averaged Recall-Precision curve at different levels of recall and Figure 3 illustrates
the Recall-Precision curve at given document cutoff values, both for 2002 and 2001 topic sets. As regards
Topic 2002 − Interpolated recal vs average precision</p>
      <p>Topic 2001 − Interpolated recal vs average precision
80%
70%
60%
iin50%
o
s
c
e
reP40%
g
a
r
e
v
A30%
20%
10%
80%
70%
60%
iin50%
o
s
c
e
reP40%
g
a
r
e
v
A30%
20%
10%
NoStem
Porter−like
SPLIT
SPLIT−L3</p>
      <p>NoStem
Porter−like
SPLIT
SPLIT−L3
the use of link-based stemming algorithms, it is worth noting that SPLIT can attain levels of effectiveness
being comparable to one based on linguistic knowledge. This is surprising if you know that SPLIT was
40%
30%
n
o
iir
s
c
e
P
20%
10%
05%docs 10 docs 15 docs 20 docs 30 docs 100 docs 200 docs 500 docs 1000 docs</p>
      <p>Retrieved documents
(a) 2002 Topics
05%docs 10 docs 15 docs 20 docs 30 docs 100 docs 200 docs 500 docs 1000 docs</p>
      <p>Retrieved documents
(b) 2001 Topics
built without any sophisticated extension to HITS and that neither heuristics nor linguistic knowledge
was used to improve effectiveness. It should also be considered as a good result, if you consider that it
has also been obtained for the Italian language, which is morphologically more complex than English.
5</p>
    </sec>
    <sec id="sec-7">
      <title>Conclusions and Future Work</title>
      <p>The objective of this research was to investigate a stemming algorithm based on link analysis procedures.
The idea has been that prefixes and suffixes, that are stems and derivations, form communities once
extracted from words. We tested this hypothesis by comparing the retrieval effectiveness of SPLIT, a
link analysis based algorithm derived from HITS, with a linguistic knowledge based algorithm, on a quite
morphologically complex language as it is the Italian language.</p>
      <p>The results are encouraging because effectiveness level of SPLIT is comparable to that developed by
Porter. The results should be considered even better since SPLIT does not incorporate any heuristics
nor linguistic knowledge. Moreover, stemming, and then SPLIT, showed to improve effectiveness with
respects to not using any stemmer.</p>
      <p>We are carrying out further analysis at a micro level to understand the conditions under which SPLIT
performs better or worse compared to other algorithms. Further work is in progress to improve the
probabilistic decision criterion and to insert linguistic knowledge directly in the link-based model by thus
weighting links among prefixes and suffixes with a probabilistic function which could capture available
information on the language, such as, for example, the minimum length of a stem. Finally, further
experimental work is in progress with other languages.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Borodin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.O.</given-names>
            <surname>Roberts</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.S.</given-names>
            <surname>Rosenthal</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P.</given-names>
            <surname>Tsaparas</surname>
          </string-name>
          .
          <article-title>Finding authorities and hubs from link structures on the World Wide Web</article-title>
          .
          <source>In Proceedings of the World Wide Web Conference</source>
          , pages
          <fpage>415</fpage>
          -
          <lpage>429</lpage>
          ,
          <string-name>
            <given-names>Hong</given-names>
            <surname>Kong</surname>
          </string-name>
          ,
          <year>2001</year>
          . ACM Press.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>C.</given-names>
            <surname>Cleverdon</surname>
          </string-name>
          .
          <article-title>The Cranfield Tests on Index Language Devices</article-title>
          . In K. Sparck Jones and P. Willett (Eds.).
          <source>Readings in Information Retrieval</source>
          , pages
          <fpage>47</fpage>
          -
          <lpage>59</lpage>
          , Morgan Kaufmann,
          <year>1997</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>W.B.</given-names>
            <surname>Frakes</surname>
          </string-name>
          and
          <string-name>
            <given-names>R.</given-names>
            <surname>Baeza-Yates</surname>
          </string-name>
          .
          <article-title>Information Retrieval: data structures and algorithms</article-title>
          . Prentice Hall,
          <year>1992</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>J.</given-names>
            <surname>Goldsmith</surname>
          </string-name>
          .
          <article-title>Unsupervised learning of the morphology of a natural language</article-title>
          .
          <source>Computational Linguistics</source>
          ,
          <volume>27</volume>
          (
          <issue>2</issue>
          ):
          <fpage>154</fpage>
          -
          <lpage>198</lpage>
          ,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>M.</given-names>
            <surname>Hafer</surname>
          </string-name>
          and
          <string-name>
            <given-names>S.</given-names>
            <surname>Weiss</surname>
          </string-name>
          .
          <article-title>Word segmentation by letter successor varieties</article-title>
          .
          <source>Information Storage and Retrieval</source>
          ,
          <volume>10</volume>
          :
          <fpage>371</fpage>
          -
          <lpage>385</lpage>
          ,
          <year>1994</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>D.</given-names>
            <surname>Harman</surname>
          </string-name>
          . How effective is suffixing?
          <source>Journal of the American Society for Information Science</source>
          ,
          <volume>42</volume>
          (
          <issue>1</issue>
          ):
          <fpage>7</fpage>
          -
          <lpage>15</lpage>
          ,
          <year>1991</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>J.</given-names>
            <surname>Kleinberg</surname>
          </string-name>
          .
          <article-title>Authoritative sources in a hyperlinked environment</article-title>
          .
          <source>Journal of the ACM</source>
          ,
          <volume>46</volume>
          (
          <issue>5</issue>
          ):
          <fpage>604</fpage>
          -
          <lpage>632</lpage>
          ,
          <year>September 1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>R.</given-names>
            <surname>Krovetz</surname>
          </string-name>
          .
          <article-title>Viewing Morphology as an Inference Process,</article-title>
          .
          <source>In Proceedings of the ACM International Conference on Research and Development in Information Retrieval (SIGIR)</source>
          ,
          <year>1993</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>J.</given-names>
            <surname>Lovins</surname>
          </string-name>
          .
          <article-title>Development of a stemming algorithm</article-title>
          .
          <source>Mechanical Translation and Computational Linguistics</source>
          ,
          <volume>11</volume>
          :
          <fpage>22</fpage>
          -
          <lpage>31</lpage>
          ,
          <year>1968</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <article-title>The Jakarta Project</article-title>
          . Lucene. http://jakarta.apache.org/lucene/docs/index.html,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>C.D.</given-names>
            <surname>Manning</surname>
          </string-name>
          and
          <string-name>
            <given-names>H.</given-names>
            <surname>Schu</surname>
          </string-name>
          <article-title>¨tze. Foundations of statistical natural language processing</article-title>
          . The MIT Press,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>C.D.</given-names>
            <surname>Paice</surname>
          </string-name>
          . Another Stemmer.
          <source>In ACM SIGIR Forum</source>
          ,
          <volume>24</volume>
          ,
          <fpage>56</fpage>
          -
          <lpage>61</lpage>
          ,
          <year>1990</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>M.</given-names>
            <surname>Popovic</surname>
          </string-name>
          and
          <string-name>
            <given-names>P.</given-names>
            <surname>Willett</surname>
          </string-name>
          .
          <article-title>The effectiveness of stemming for natural-language access to sloven textual data</article-title>
          .
          <source>Journal of the American Society for Information Science</source>
          ,
          <volume>43</volume>
          (
          <issue>5</issue>
          ):
          <fpage>383</fpage>
          -
          <lpage>390</lpage>
          ,
          <year>1992</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>M.</given-names>
            <surname>Porter</surname>
          </string-name>
          .
          <article-title>Snowball: A language for stemming algorithms</article-title>
          . http://snowball.sourceforge.net,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>M.F.</given-names>
            <surname>Porter</surname>
          </string-name>
          .
          <article-title>An algorithm for suffix stripping</article-title>
          .
          <source>Program</source>
          ,
          <volume>14</volume>
          (
          <issue>3</issue>
          ):
          <fpage>130</fpage>
          -
          <lpage>137</lpage>
          ,
          <year>1980</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>G.</given-names>
            <surname>Salton and M. McGill</surname>
          </string-name>
          .
          <article-title>Introduction to modern Information Retrieval</article-title>
          .
          <string-name>
            <surname>McGraw-Hill</surname>
          </string-name>
          , New York, NY,
          <year>1983</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>G.</given-names>
            <surname>Salton</surname>
          </string-name>
          and
          <string-name>
            <given-names>C.</given-names>
            <surname>Buckley</surname>
          </string-name>
          .
          <article-title>Term weighting approaches in automatic text retrieval</article-title>
          .
          <source>Information Processing &amp; Management</source>
          ,
          <volume>24</volume>
          (
          <issue>5</issue>
          ):
          <fpage>513</fpage>
          -
          <lpage>523</lpage>
          ,
          <year>1988</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <article-title>Institut interfacultaire d'informatique. CLEF and Multilingual information retrieval</article-title>
          . University of Neuchatel. http://www.unine.ch/info/clef/,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>C.</given-names>
            <surname>Buckley</surname>
          </string-name>
          . Trec eval. ftp://ftp.cs.cornell.edu/pub/smart/,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>