<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>ARKTiS - A Fast Tag Recommender System Based On Heuristics</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Thomas Kleinbauer</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sebastian Germesin</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>German Research Center for Arti cial Intelligence (DFKI) 66123 Saarbrucken</institution>
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper describes the tag recommender system ARKTiS, our contribution to the 2009 ECML PKDD tag discovery challenge. ARKTiS consists of two separate modules for BibTEX entries and for bookmarked web pages. For generating tags, we distinguish between so-called internal and external methods, depending on whether a tag was extracted from the given information about a resource or whether additional resources were employed.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        The role of the end-user in the world wide web (WWW) has undergone a
substantial change in recent years from a passive consumer of relatively static web
pages to a central content producer. The addition of backchannels from WWW
clients to internet servers empowers non-expert users to actively participate in
the generation of web content. This has led to a new paradigm of usage,
colloquially coined \Web 2.0" [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>These novel kinds of interactions can be divided into two categories:
producing or making accessible of new information (e.g., web logs, forums, wiki wikis,
etc.) and enriching already existing contents (e.g., consumer reviews,
recommendations, tagging, etc.). One interpretation of the second type of interaction
is that it provides means to cope with one of the problems generated by the
rst type of interaction, namely the massive growth of available content and the
increasing di culty for traditional information retrieval approaches to support
e cient access to the contained information. In this sense, the meta-content
produced in the second kind of interaction can be construed as mainly serving as a
navigation aid in an increasingly complex, but weakly structured online-world.</p>
      <p>In particular, the possibility for users to attach keywords to web resources
in order to describe or classify their contents bares an enormous potential for
structuring information which facilitates subsequent access by both the original
user and other users. This task, commonly referred to as tagging is simple enough
not to scare users away, yet the bene t of web resources annotated in such a
way is obvious enough to keep the motivation to supply tags high. The process
of attaching tags to web resource must therefore show a ne balance between
simplicity and quality. Both of these properties could be greatly improved if
an automatic system could support a user by recommending tags for a given
resource.</p>
      <p>In this paper, we describe the ARKTiS system, developed to recommend tags
to a user for two speci c types of web resources. The system was developed as a
contribution to an international challenge as part of the 2009 European
Conference on Machine Learning and Principles and Practice of Knowledge Discovery
in Databases (ECML PKDD 2009).
2</p>
    </sec>
    <sec id="sec-2">
      <title>Task and Data</title>
      <p>The goal of this challenge is the implementation of a system that can
automatically generate tag recommendations1 for a given resource. Here, a resource
is either a bookmark of a web page or a BibTEX entry for di erent kinds of
documents. Tags typically are short English words although they may also be
arti cially created terms, generated e.g. by concatenating words (\indexforum")
or by using abbreviations (\langfr", for \language french"). The number of tags
a system may generate is restricted to a maximum of ve.</p>
      <p>To prepare for the challenge, a training set of manually tagged resources
was provided. The data set consists of web page bookmarks and BibTEX entries
taken from the BibSonomy project2. Each entry has a unique id and was tagged
by at least two users. Thus a point in the data set can be viewed as a triple
&lt;resource-id, user-id, tags&gt;.</p>
      <p>For each resource, the corpus contains meta-data describing the resource
with a number of di erent elds. These elds are di erent for BibTEX entries
and for bookmark entries. For instance, the meta-data for BibTEX entries contain
elds describing the title of an article, the authors, the year of publication, or
the number of pages. For bookmarks, one eld gives a short description of the
resource while another one contains the web pages' URL. A full list of all available
elds can be found on the homepage of the challenge.</p>
      <p>In total, the training corpus contains 41,268 entries for bookmarks and 22,852
BibTEX entries, annotated with 1,3276 unique tags by 1,185 di erent users.</p>
      <p>The data of the actual challenge (the eval set ), was provided 48 hours before
the submission deadline. This set consists of unseen data that each system had
to tag and all results that are presented and analyzed in section 5 were achieved
on this set. The evaluation data set contains 43,002 entries in total, with 26,104
BibTEX- and 16,898 bookmark-entries - circa two thirds of the amount of the
training data.
1 For the remainder of this paper, we will refer to this process as \(automatic) tagging"
2 http://www.bibsonomy.org</p>
    </sec>
    <sec id="sec-3">
      <title>Approach</title>
      <p>One key observation for our participation in the ECML PKDD challenge was
that time played a central role in two di erent senses. First, the task required
each system to suggest tags for quite a large number of resources in a rather
short period of time: 43,002 entries in only 48 hours, including retrieval and
parsing of the test data as well as formatting and uploading of the nal result
data. Second, since the challenge had a xed deadline, the time to develop a
running system faced a naturally limit with the release of the 48 hour evaluation
period.</p>
      <p>Both points had a direct in uence on the conceptualization and realization
of our system ARKTiS, in that a number of more sophisticated ideas had to be
sacri ced. As a result, ARKTiS can be seen as an exercise in software
engineering rather than thorough science. The nal system implements straight-forward
strategies with a focus on robustness and processing speed.
3.1</p>
      <sec id="sec-3-1">
        <title>Motivation</title>
        <p>As outlined above, the training corpus contains 13276 di erent tags for over
41268 data points, a proportion that indicates a potential data sparseness
problem for classical machine learning approaches. We therefore opted for an
algorithmic approach instead, based on heuristic considerations.</p>
        <p>Since the desired system output is (English) words, we can distinguish two
potential sources for output: from within the resource itself (internal words) or
from outside material (external words). This distinction is in so far blurred, as
the system input actually consists not of the resources themselves, but rather
of meta-data. In so far, even tags taken from the resources themselves could be
argued to be external. We take the view that words stemming from meta-data
or the resources referred to by the meta-data are considered internal.</p>
        <p>Since we could not hope to implement a competitive system, we were mainly
interested in how useful such a distinction would be in terms of recommending
tags. Although we concentrated on internal methods as described below, we
explored using document similarity measures in the BibTEX module to re-use
tags that were manually assigned to documents similar to the current system
input. Also, some of our implemented techniques, such as translating German
words from the original resource to English, can be considered borderline between
internal and external.</p>
        <p>
          For the internal approaches, we looked at the task of tagging a resource as an
analogy to automatic text summarization, somewhat taken to an extreme where
a \summary" consists only of ve words. In extractive summarization, summaries
of documents are generated by identifying those sentences in a document which
when concatenated serve as surrogate for the original document. In that spirit,
tagging becomes the task of identifying those words from a resource that together
describe the \aboutness" of the resource.
3.2
A number of researchers in recent years have engaged in the task of developing
an automatic tagging system. [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] use a natural language resources to suggest tags
for BibTEX entries and web pages. They include conceptual information external
resources, such as WordNet [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ], to create the notion of a \concept space". On
top of this notion, they exploit the textual content that can be associated with
bookmarks, documents and users and generate models within the concept space
or the tag space derived from the training data.
        </p>
        <p>
          [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] model the problem of automated tag suggestion as a multi-label text
classi cation problem with tags as categories.
        </p>
        <p>
          In [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ], the TagAssist system focuses on the task of the generation of tags
for web-log entries. They access tags of similar documents in a similar spirit to
our own method described in section 4.1.
        </p>
        <p>In addition to these concrete systems, we nd automatic tagging to bare some
similarities with research in automatic extractive summarization. In both task,
the identi cation of salient portions of a resource's text is a central consideration.
For tagging which reduces extraction single words, we call such a method an
internal approach. (s. section 3.1).</p>
        <p>
          For instance, in [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ], the author conducted a wide range of tests to nd
predictive features for relevant sentences. Despite relying on manual experiments, the
general results from this early research were later con rmed by machine
learning approaches, e.g., [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]. For the bookmark modules, both the title and the rst
sentence heuristic (see 4.2) were inspired by these ndings.
4
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Taggers</title>
      <p>As hinted by the data set, the task can be viewed as two di erent sub-tasks,
the tagging of BibTEX entries and the tagging of bookmarked web pages.
Consequently, the ARKTiS system consists of two independent modules which are
both instances of a common framework architecture, depicted in Figure 1. The
modules can be run in two distinct processes.</p>
      <p>
        E cient processing of the input data is an important requirement for this
challenge. In a sequential architecture, processing 43002 data points in 48 hours
would leave a tagging system about 4 seconds per data point on average. Given
that the data points contain only metadata and that the actual documents, if
needed, have to be retrieved through the internet and parsed, a time span of
4 seconds poses quite a strong limitation on the complexity of the performed
computations. Running multiple instances of taggers concurrently relaxes this
limitation. In our current setup, both modules for BibTEX and bookmark
tagging internally run ten tagging threads in parallel which increases the maximum
average processing time to 80 seconds per data point.
The tagging system responsible for the BibTEX entries uses a combination of
internal and external techniques. A thorough investigation of the provided training
material showed that most of the entries (95.4%) do not contain a valid link to
the actual PDF document. This is unfortunate, as it limits internal approaches
which draw tags from the contents of the document in question.
Internal approach To compensate the cases in which the PDF document is
unavailable, we use the remaining information from the meta-data, namely the
title of the document, its description and its abstract. The employed approach
analyzes these elds and extracts tags directly out of their textual information.
Before that, we lowercase all words in the text of each eld and remove all
punctuation and symbols. After that, we apply the POS tagging system described in
[
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] to extract content-words - nouns, adjectives and verbs - out of the text. The
sequential processing of the text is shown below:
      </p>
      <p>Pre-processing
Original title: The PageRank Citation Ranking: Bringing Order to the Web
Lowercased: the pagerank citation ranking: bringing order to the web
No symbols: the pagerank citation ranking bringing order to the web</p>
      <p>Extraction of Tags
POS-tagged: otrhdee/rD/TNNpatgeor/TaOnk/tNhNe/DcTitawteib/oNn/NNN ranking/NN bringing/VBG
Content words: pagerank citation ranking order web</p>
      <p>After removing stop-words, the remaining words are directly used as tags,
giving preference to tags stemming from the title eld over those from description
eld over those from the abstract eld. Only the rst ve tags are returned after
ltering out duplicates.</p>
      <p>External approach For the remaining entries - where the source documents
were available - we use a corpus-based approach inspired by standard information
retrieval techniques. The idea here is that if a new document is similar to a
document from the training corpus, we may re-use the tags that have been
added manually to the training document.</p>
      <p>Hence, this part of the tagger rst ranks all documents from the training data
by similarity to the current document. A second step then takes tags from the
documents in rank order and returns the rst ve of them, discarding duplicates.</p>
      <p>To do so, all documents in the corpus are rst transformed from PDF format
to plain text by the PDFBox toolkit3. After that, the whole text is segmented
into sentences using punctuation information (.!?;:\n) and then pre-processed
in the same way as described in the internal approach. After removing
noncontent words, we calculate tf.idf values for each word in the document, resulting
in the following mapping:</p>
      <p>&lt;list of tags&gt; ! &lt;vector of TF/IDF-values&gt;</p>
      <p>A tf.idf value is a value that calculates the relative importance of the word
wi for the current document j, in relation to a set of documents D (see equation
1, where ni;j is the number of occurrence of the word i in document j).
tf idfi;j =</p>
      <p>ni;j
Pk nk;j
log</p>
      <p>jDj
jfd 2 D : wi 2 dgj</p>
      <p>This procedure is, of course, carried out only once and the resulting
mapping is stored o ine. In the actual tagging process, we generate the vector of
tf.idf values in the same way for the document to tag and compare the resulting
vector to with all document vectors in the corpus. We have experimented with
two di erent similarity measures.</p>
      <p>The rst variant compares of two documents by the normalized distance
between their tf.idf vectors, as shown in equation 2.
(1)
(2)
(3)
sim(t0; t1) =</p>
      <p>PiN=01 jt0[i] t1[i]j</p>
      <p>PiN=01 t0[i] + t1[i]</p>
      <p>In addition, we implemented cosine similarity that measures similarity by the
cosine of the angle between the two vectors that the two documents describe
(equation 3).</p>
      <p>sim(t0; t1) = cos
=</p>
      <p>PiN=01 t0[i]t1[i]
qPiN=01 t0[i]qPiN=01 t1[i]</p>
      <p>In our experiments, the normalized distance measure yielded better
performance than cosine similarity and consequently we used only the former in the
nal system.</p>
      <sec id="sec-4-1">
        <title>3 http://incubator.apache.org/pdfbox/</title>
        <p>4.2</p>
        <sec id="sec-4-1-1">
          <title>The Bookmark Tagger</title>
          <p>As in the case for the BibTEX tagger, the bookmark tagger relies on relatively
simple heuristics to determine the keywords to recommend. The input data
provides two kinds of information, the URL of the web page to tag and a short
description which in some cases is identical to the web page's title string.
Processing the URL eld In our system, the URL is used to fetch the contents
of the actual web page, but since the domain name and path may already contain
candidate terms, the URL string is also processed itself, in three sequential steps:
tokenizing, ltering, and dict/split.</p>
          <p>For the tokenization, the URL is split up at every non-letter non-digit
character, such as a forward slash. By matching against a manually crafted blacklist
of terms generally considered uninformative, typical artifacts such as \www" or
\html" that result from the tokenization process are ltered out. The following
examples illustrate these two steps:</p>
          <p>Original URL: http://www.example.com/new-example/de/bibtex.htm
Tokenizing: http www example com new example de bibtex htm
Filtering: example new example de bibtex
Original URL: http://www.coloradoboomerangs.com
Tokenizing: http www coloradoboomerangs com</p>
          <p>Filtering: coloradoboomerangs
A dictionary of American English together with a list of the names of all articles
in the English Wikipedia4 of 2007 are used to check if the resulting tokens are
actual words. The rationale for incorporating Wikipedia is that it gives additional
terms from article titles which often are not found in a dictionary, such as, e.g.
technical terms (\bibtex"). If a token cannot be found in either list, we try to
split the token up into two sub-tokens which, in case they are both contained in
the dictionary, are then used instead of the original tokens. This idea is based on
the observation that domain names in particular are sometimes a concatenation
of two terms.</p>
          <p>Applied to the above example, this step generates the following keyword lists:
Keywords: example new example de bibtex
Dict/Split: example new example bibtex</p>
        </sec>
      </sec>
      <sec id="sec-4-2">
        <title>4 http://en.wikipedia.org</title>
        <p>Processing the description eld The description that is part of the input
data is tokenized in the same way. However, no further attempts are made to
lter out tokenization artifacts or to split the resulting tokens into sub-parts in
case they are not contained in the dictionary. In other words, of the above three
steps, only tokenizing is performed on the description of the bookmark.
Processing the bookmarked web page With the provided URL, the content
of the given web page is retrieved at run-time. We do not attempt to detect
whether the server returns an actual content page or a specialized message, such
as a HTTP 404 Not Found error page. After the HTML content of a web page has
been downloaded, three di erent extraction methods are applied: HTML-meta,
title, and rst sentence.</p>
        <p>The rst method operates on the head section of the document where it
locates and parses the &lt;meta&gt; elements \keywords" and \description". The
contents of these elements are provided by the author of the HTML document and
may contain valuable hints on what the document actually is about. The
contents are extracted and then undergo the same tokenizing procedure as described
above.</p>
        <p>HTML:</p>
        <p>...
&lt;/head&gt;
...
&lt;/html&gt;
&lt;html&gt;
&lt;head&gt;
&lt;meta name=keywords content="example, sample"&gt;
&lt;meta name="description" content="A made-up</p>
        <p>example webpage"&gt;
Extracting: example, sample</p>
        <p>a made-up example webpage
Tokenizing: example sample a made up example webpage</p>
        <p>
          Also in the head section is the declaration of the title of the document. This
is not only intuitively a good source for relevant keywords, but research in the
eld of automatic text summarization has also shown in the past that headings
contain informative content [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ].
        </p>
        <p>HTML: &lt;title&gt;Hello, world - again, an example&lt;/title&gt;
Extracting: hello, world - again, an example
Tokenizing: hello world again an example</p>
        <p>Another nding from summarization research is that locational cues work
well for determining relevant content words. To apply this insight to the task
at hand, the third document-based method tokenizes the rst sentence of each
HTML document in the same manner described above.</p>
        <p>The result of these steps is a set of basic terms. For the nal recommendation,
two more processing steps are performed, a ranking step and a normalization
step.</p>
        <p>Ranking keywords For the ranking, each of the previously extracted keywords
is described according to four prede ned dimensions: Source, InDict, POS,
and Navigational.</p>
        <p>The values for these dimensions are oating point numbers that represent
how valuable a keyword is with respect to being among the recommended tags.
For instance, analog to the rst step described above, the Source dimension
may receive one of the following values:
{ URL (= 0:4)
{ Description (= 1)
{ HTML-Meta (= 1)
{ Title (= 0:8)
{ First sentence (= 0:9)</p>
        <p>
          The other dimensions describe whether a keyword is found in the English
dictionary (and/or list of Wikipedia articles), its part of speech (NN = 1, NNS
= 0:9, VBG = 0:8, VERB = 0:5, OTHER = 4) and whether it is found
on a blacklist of navigational terms, such as \impress", \home", etc. which was
created manually by the authors. As with other heuristics, the idea of using such
stigma word lists can also be tracked back to early summarization research, see
e.g. [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ].
        </p>
        <p>To rank the keywords, a weighted sum of the four values is computed for
each keyword. Since a training corpus was available, good practical weights could
have been determined with a machine learning approach. Unfortunately, since
time was scarce, we had to estimate sensible weights by hand{inspecting the
performance of the tagger on selected samples from the training corpus helped
in this part of the development.</p>
        <p>Normalization In a nal normalization step, a German-English dictionary is
used to translate German keywords to English ones and to re-weight those
keywords that contain other keywords as sub-strings. In such a case it was speculated
that the keyword contained as a sub-string would likely be the more general term
and thus its nal score was slightly increased.
5</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Results</title>
      <p>In the evaluation, the results of our tagging system ARKTiS had to be compared
against the tags that were annotated by a human. The test data were provided
48 hours before the submission deadline. Considering at most ve tags per
entry, the evaluation uses precision, recall and f-score values as measurements. In
the following, we will present our results, that were achieved on this data and
compare them against a baseline system. The baseline system predicts the ve
most common tags from the training data (Figure 3) to each input entry.</p>
      <p>BibTEX: ccp jrr programming genetic algorithms
Bookmarks: software indexforum video zzztosort bookmarks</p>
      <p>The results of the baseline system are presented in Table 1 where we can see
a maximum f-score of 0.55%. Comparing this to the results of ARKTiS (Table 2),
we can see that our system clearly outperforms the baseline with an f-score of
almost 11%.
Our work shows that is is possible to design and implement a basic tag
recommender system even with a very limited development time. The two tracks,
BibTEX and bookmark tagging, were designed and realized independently but
on top of a common, concurrent framework.</p>
      <p>The overall task can be considered challenging, especially if results are
evaluated on the basis of recall and precision: our nal system scored a rather low
11% f-score. The large number of di erent gold-standard tags makes this number
di cult to interpret; however, it is clear that it leaves room for improvement.
The winning entry of the 2009 challenged reached an f-score of 19 percent.</p>
      <p>In a more detailed analysis, we found that the bookmark module
outperformed the BibTEX module to some degree. As described above, the two modules
employ rather di erent approaches, thus a next logical step will be to combine
the best ideas from both modules.</p>
      <p>The biggest drawback for ARKTiS as described in this paper was the fact
that we entered the ECML PKDD challenge at a late point. As a consequence,
a number of interesting and more sophisticated ideas had to be left out of the
system purely due to the lack of implementation time. For instance, the use of
tf.idfscores in the BibTEX module is very limited, as is the use of content terms
beyond the rst sentence in the bookmark module.</p>
      <p>At the same time, the ARKTiS system has proven its robustness and will be
a good starting point for further research in the area of automatic tagging.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Oreilly</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>What is web 2.0: Design patterns and business models for the next generation of software</article-title>
          .
          <source>Social Science Research Network Working Paper Series</source>
          (
          <year>2005</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Tatu</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Srikanth</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>D'Silva</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Rsdc'08: Tag recommendations using bookmark content</article-title>
          .
          <source>In: ECML PKDD Discovery Challenge</source>
          <year>2008</year>
          . (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Fellbaum</surname>
          </string-name>
          , C., ed.:
          <source>WordNet{An Electronic Lexical Database</source>
          . MIT Press (
          <year>1998</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Vlahavas</surname>
            ,
            <given-names>I.K.G.T.I.</given-names>
          </string-name>
          :
          <article-title>Multilabel text classi cation for automated tag suggestion</article-title>
          .
          <source>In: ECML PKDD Discovery Challenge</source>
          <year>2008</year>
          . (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Mishne</surname>
          </string-name>
          , G.:
          <article-title>Autotag: A collaborative approach to automated tag assignment to weblog posts</article-title>
          .
          <source>In: Proceedings of WWW'06</source>
          ,
          <string-name>
            <surname>Edinburgh</surname>
          </string-name>
          , Scotland (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Edmundson</surname>
            ,
            <given-names>H.P.</given-names>
          </string-name>
          :
          <article-title>New methods in automatic extracting</article-title>
          .
          <source>Journal of the ACM</source>
          <volume>16</volume>
          (
          <issue>2</issue>
          ) (
          <year>1969</year>
          )
          <volume>264</volume>
          {
          <fpage>285</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Kupiec</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pedersen</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>A trainable document summarizer</article-title>
          .
          <source>In: SIGIR '95: Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval</source>
          , New York, NY, USA, ACM (
          <year>1995</year>
          )
          <volume>68</volume>
          {
          <fpage>73</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Phan</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          :
          <article-title>CRFTagger: CRF English POS Tagger</article-title>
          .
          <article-title>(</article-title>
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Mani</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Maybury</surname>
          </string-name>
          , M.T., eds.
          <source>: Advances in Automatic Text Summarization</source>
          . The MIT Press, Cambridge, MA (
          <year>1998</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>