<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A Pipeline Tweet Contextualization System at INEX 2013</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Khaled Hossain Ansary</string-name>
          <email>ansary@L3S.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Anh Tuan Tran</string-name>
          <email>ntran@L3S.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nam Khanh Tran</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Leibniz Universitat Hannover / Forschungszentrum L3S</institution>
          ,
          <addr-line>Hannover</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This article describes a pipeline system and preliminary results for Tweet Contextualization at INEX 2013. The system consists of three steps: tweet analysis, passage retrieval and summarization. For each tweet, key phrases are rst extracted by making use of ArkTweet toolkit and employing several heuristics. They are then submitted as queries to Indri search engine to retrieve relevant passages. Finally, a multi-document summarization system (MEAD) is used to generate the output document with a limit of 500 words. The preliminary results show that the approach does not work well where our run was ranked 22nd out of 24 runs. We discuss our observations for these results and some further possible improvements.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        There has been some studies done for this task. While [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ] presents the
improvement of the question answering techniques using information retrieval (IR), [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]
      </p>
      <sec id="sec-1-1">
        <title>1 https://twitter.com/</title>
        <p>Phrase
Chunker</p>
        <p>Passage
Retriever</p>
        <p>Full-text
index</p>
        <p>Indexer</p>
        <p>Score
Resolver</p>
        <p>Position Resolver</p>
        <p>Document
Reconstructor</p>
        <p>Multi-doc</p>
        <p>Summarizer</p>
        <p>Output</p>
        <p>
          Converter
describes a hybrid tweet contextualization system using IR and automatic
summarization. They used Nutch architecture and TF-IDF based sentence ranking
and sentence extracting techniques for automatic summarization. An approach
based on the mapping of source documents in a reduced semantic space is
proposed by [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. They estimated the words from the semantic space via a latent
dirichlet allocation (LDA) algorithm. [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] developed and tested a statistical word
stemmer which used by the CORTEX to preprocess input texts and generate
readable summary. [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] describes a sentence retrieval technique which applied
three methodologies: i) language modeling score, ii) relevance modeling score
and ii) topical relevance modeling score.
        </p>
        <p>
          Text summarization has been well-studied through several work in arti
cial intelligence communities, especially text mining and information retrieval.
Among them, MEAD [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] is a publicly available toolkit for multi-document
summarization, which generates summaries using cluster centroids produced by topic
detection and tracking system.
3
        </p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>A Pipeline Tweet Contextualization System</title>
      <p>The system pipeline is described as shown in Figure 1. It consists of three
components: Phrase Chunker, Passage Retriever and Summarizer.
3.1</p>
      <p>Phrase Chunker
As shown in the system work ow, the rst step is to retrieve passages from
Wikipedia registered articles given a tweet of interest. As in the traditional
retrieval approach, we initially used words presented in the tweet to retrieve the
relevant passages from Wikipedia. However, we observed an acceptably low
performance when using original words to query the indices. This is attribute to the
highly noisy nature of tweet contents, where the key phrase often mixed with
non-content words such as emoticons, over-used punctuations, etc.. In addition,
users employ several ad-hoc formats that are hardly found elsewhere when
posting tweets. They can use hashtags (a single word starting with '#') to provide
implicit context of the tweet, or use the at (@) symbol to tag other twitter
accounts in the content. In many cases, words are intentionally modi ed, such as
repeating vowels to express emotions (e.g. 'so coooooooool this show was !! :=)'),
etc. Such writing styles leads to many irrelevant results and propagates the noise
to the next step.</p>
      <p>
        To accommodate the passage retrieval, we tuned our phrase chunker so as to
detect and extract key phrases that are more informative than the others from
the tweet content. We used ArkTweet toolkit [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] to tokenize the tweet content,
and to annotate each token with an adjusted Part-of-speech tags. Apart from
Penn TreeBank tagset, ArkTweet introduces a number of specialized tags in
Twitter domain, such as hashtag (#), at-mention (@), discourse marker ( ) to
indicate the continuation of message across multiple tweets such as Retweets,
URL (U), or emoticon (E). Detailed references can be found at http://www.
ark.cs.cmu.edu/TweetNLP/annot_guidelines.pdf.
      </p>
      <p>After tokenizing the tweet, we employed several heuristics to detect the key
phrases as overlapping consecutive tokens. For example, we restricted that a key
phrase cannot be a mix of hashtags and other words, or we skipped phrases
that contain no Penn TreeBank tags. The chunker iteratively generates all
ngrams, where n varies from 1 to 5. For each n-gram, it checks against each of
the heuristic. We applied a dynamic programming approach to make sure two
heuristics is not checked again on the subsumed grams.
3.2</p>
      <p>Passage Retriever
We retrieved relevant Wikipedia articles for each tweet via the provided API of
the track. The methodology adopted by us can be described as follows. Each
extracted phrases for a given tweet was submitted as a query to Indri search
engine and we obtained three di erent les for our purpose in the following
format:
{ The \docid" les contain the sentences which we retrieved from the API.</p>
      <p>The sentences which collected from the same document are merged, stored
and then used as input for the summarization component.
{ The docid and the phrase rank of the corresponding sentences are stored
into the \docid.id" le
{ The docid and the resultant scores stored into the le \docid.score". The
average scores calculated for the same document phrase id. These scores use
to submit as a part of our run.</p>
      <p>
        Summarizer
We make use of MEAD toolkit2 for this component. MEAD is a multi-document
summarization system proposed by Radev et al. [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] implemented centroid-based
approach and is then enhanced with various of features later. We adapted the
system with various parameter settings including position, similarity with the
rst sentence, centroid, query-based features, MEAD-cosine similarity routine
re-ranker with threshold value = 0.7 and enidf IDF database.
4
      </p>
    </sec>
    <sec id="sec-3">
      <title>Results</title>
      <p>The output summaries were evaluated according to their informativeness and
readability. Table 1 and Table 2 compare the performance of our submitted run
with the best one at INEX 2013 in terms of informativeness and readability,
respectively.</p>
      <p>RunID Rank Unigram Bigram Skip Bigram
266 22 0.9059 0.9824 0.9835
256 1 0.8861 0.881 0.782</p>
      <p>We observed that the phrases extracted from tweets contains some
unexpected noises which need to be cleaner. A heuristics-based approach relies
heavily on a small set of tweets to be scrutinized, and it is di cult to generalize in
the arbitrary domains of tweets. This can a ect the retriever components where
irrelevant sentences are retrieved as results of noisy phrases. Another
observation is that creating the documents by merging retrieved sentences and treating
them as input for MEAD toolkit can make these documents less readable. One
key point in MEAD summarization is the assumption of relatedness between</p>
      <sec id="sec-3-1">
        <title>2 We use the latest version</title>
        <p>http://www.summarization.com/mead/
MEAD
3.12
published
at
sentences in one documents, and build a graph of inter-references from such
relatedness. This does not really t to the re-construction of tweets as conducted in
the rst two steps of the pipeline. Nevertheless, this observation calls for future
approaches in text summarization, where sentences are less coupled and thus
should be modeled less dependently
5</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Conclusion</title>
      <p>The pipeline system has been developed as part of the participation in the Tweet
Contextualization track of INEX 2013. The system was evaluated by using the
evaluation metrics provided by the committees with reasonable results with its
initial implementation.</p>
      <p>Further works will be motivated towards improving the performance of the
system by enhancing the quality of phrases from tweets, considering semantic
similarity for retrieving relevant documents.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1. lvaro Rodrigo, Prez-iglesias, J.,
          <string-name>
            <surname>Peas</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Garrido</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Araujo</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <article-title>A question answering system based on information retrieval and validation (</article-title>
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Schi</surname>
            <given-names>man</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            ,
            <surname>Mckeown</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.R.</given-names>
            ,
            <surname>Grishman</surname>
          </string-name>
          , R.:
          <article-title>Question answering using integrated information retrieval and information extraction</article-title>
          .
          <source>In: in Proceedings of HLT/NAACL</source>
          . (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Bhaskar</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>B.S.:</surname>
          </string-name>
          <article-title>A hybrid tweet contextualization system using ir and summarization</article-title>
          .
          <source>In: in Proceedings of INEX</source>
          <year>2012</year>
          .
          <article-title>(</article-title>
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Morchid</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Linares</surname>
          </string-name>
          , G.:
          <article-title>A semantic space for tweets contextualization</article-title>
          .
          <source>In: in Proceedings of INEX</source>
          <year>2012</year>
          .
          <article-title>(</article-title>
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Torres-Moreno</surname>
            ,
            <given-names>J.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Velazquez-Morales</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Two statistical summarizers at inex 2012</article-title>
          . In: in
          <source>Proceedings of INEX</source>
          <year>2012</year>
          .
          <article-title>(</article-title>
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>Debasis</given-names>
            <surname>Ganguly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.L.</given-names>
            ,
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <surname>G.J.F.</surname>
          </string-name>
          :
          <article-title>Exploring sentence retrieval for tweet contextualization</article-title>
          .
          <source>In: in Proceedings of INEX</source>
          <year>2012</year>
          .
          <article-title>(</article-title>
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Radev</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Allison</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Blair-Goldensohn</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Blitzer</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Celebi</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dimitrov</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Drabek</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hakim</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lam</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Otterbacher</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Qi</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Saggion</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Teufel</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Topper</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Winkel</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          :
          <article-title>MEAD | A platform for multidocument multilingual text summarization</article-title>
          .
          <source>In: Conference on Language Resources and Evaluation (LREC)</source>
          , Lisbon, Portugal (
          <year>2004</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Gimpel</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schneider</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>O'Connor</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Das</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mills</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Eisenstein</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Heilman</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yogatama</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Flanigan</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Smith</surname>
            ,
            <given-names>N.A.</given-names>
          </string-name>
          :
          <article-title>Part-of-speech tagging for twitter: annotation, features, and experiments</article-title>
          . In:
          <article-title>Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers - Volume 2</article-title>
          . HLT '
          <volume>11</volume>
          ,
          <string-name>
            <surname>Stroudsburg</surname>
          </string-name>
          , PA, USA, Association for Computational Linguistics (
          <year>2011</year>
          )
          <volume>42</volume>
          {
          <fpage>47</fpage>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>