A Pipeline Tweet Contextualization System at
                    INEX 2013

           Khaled Hossain Ansary, Anh Tuan Tran, Nam Khanh Tran

     Leibniz Universität Hannover / Forschungszentrum L3S, Hannover, Germany
                    ansary@L3S.de, ttran@L3S.de, ntran@L3S.de


        Abstract. This article describes a pipeline system and preliminary re-
        sults for Tweet Contextualization at INEX 2013. The system consists
        of three steps: tweet analysis, passage retrieval and summarization. For
        each tweet, key phrases are first extracted by making use of ArkTweet
        toolkit and employing several heuristics. They are then submitted as
        queries to Indri search engine to retrieve relevant passages. Finally, a
        multi-document summarization system (MEAD) is used to generate the
        output document with a limit of 500 words. The preliminary results show
        that the approach does not work well where our run was ranked 22nd
        out of 24 runs. We discuss our observations for these results and some
        further possible improvements.


1     Introduction
The tweet contextualization task was first launched at INEX in 2011. The task is
related with the tweets1 which represent as short message around 140 characters.
The aim of tweet contextualization is to provide automatically information as
a readable summary that explains the tweet. The summary does not exceed
500 words and extracted from a cleaned dump of the English Wikipedia. The
evaluation of the summaries has done by the INEX organizers to considering
both informativeness and readability.
    The committee of INEX has been collected about 598 tweets in English
from Twitter. Tweets were selected among informative account (for example,
@CNN, @TennisTweets, @PeopleMag, @science..), in order to avoid purely per-
sonal tweets that could not be contextualized. In this article, we present the
experiments carried out as part of the participation in INEX 2013. We describe
a pipeline system where first extracts phrases from the tweets by using ArkTweet
toolkit and some heuristics; then retrieves relevant documents for these phrases
from Wikipedia before summarizing those with MEAD toolkit.


2     Related Work
There has been some studies done for this task. While [1, 2] presents the improve-
ment of the question answering techniques using information retrieval (IR), [3]
1
    https://twitter.com/
2


           Phrase           Passage               Full-text            Indexer
          Chunker           Retriever              index


                                    Position Resolver
                  Score
                 Resolver                Document              Multi-doc
                                        Reconstructor         Summarizer


                             Output
                            Converter


                     Fig. 1: Overview of the pipeline workflow


describes a hybrid tweet contextualization system using IR and automatic sum-
marization. They used Nutch architecture and TF-IDF based sentence ranking
and sentence extracting techniques for automatic summarization. An approach
based on the mapping of source documents in a reduced semantic space is pro-
posed by [4]. They estimated the words from the semantic space via a latent
dirichlet allocation (LDA) algorithm. [5] developed and tested a statistical word
stemmer which used by the CORTEX to preprocess input texts and generate
readable summary. [6] describes a sentence retrieval technique which applied
three methodologies: i) language modeling score, ii) relevance modeling score
and ii) topical relevance modeling score.
    Text summarization has been well-studied through several work in artifi-
cial intelligence communities, especially text mining and information retrieval.
Among them, MEAD [7] is a publicly available toolkit for multi-document sum-
marization, which generates summaries using cluster centroids produced by topic
detection and tracking system.


3     A Pipeline Tweet Contextualization System
The system pipeline is described as shown in Figure 1. It consists of three com-
ponents: Phrase Chunker, Passage Retriever and Summarizer.

3.1   Phrase Chunker
As shown in the system workflow, the first step is to retrieve passages from
Wikipedia registered articles given a tweet of interest. As in the traditional re-
trieval approach, we initially used words presented in the tweet to retrieve the
                                                                                3

relevant passages from Wikipedia. However, we observed an acceptably low per-
formance when using original words to query the indices. This is attribute to the
highly noisy nature of tweet contents, where the key phrase often mixed with
non-content words such as emoticons, over-used punctuations, etc.. In addition,
users employ several ad-hoc formats that are hardly found elsewhere when post-
ing tweets. They can use hashtags (a single word starting with ’#’) to provide
implicit context of the tweet, or use the at (@) symbol to tag other twitter ac-
counts in the content. In many cases, words are intentionally modified, such as
repeating vowels to express emotions (e.g. ’so coooooooool this show was !! :=)’),
etc. Such writing styles leads to many irrelevant results and propagates the noise
to the next step.
    To accommodate the passage retrieval, we tuned our phrase chunker so as to
detect and extract key phrases that are more informative than the others from
the tweet content. We used ArkTweet toolkit [8] to tokenize the tweet content,
and to annotate each token with an adjusted Part-of-speech tags. Apart from
Penn TreeBank tagset, ArkTweet introduces a number of specialized tags in
Twitter domain, such as hashtag (#), at-mention (@), discourse marker (∼) to
indicate the continuation of message across multiple tweets such as Retweets,
URL (U), or emoticon (E). Detailed references can be found at http://www.
ark.cs.cmu.edu/TweetNLP/annot_guidelines.pdf.
    After tokenizing the tweet, we employed several heuristics to detect the key
phrases as overlapping consecutive tokens. For example, we restricted that a key
phrase cannot be a mix of hashtags and other words, or we skipped phrases
that contain no Penn TreeBank tags. The chunker iteratively generates all n-
grams, where n varies from 1 to 5. For each n-gram, it checks against each of
the heuristic. We applied a dynamic programming approach to make sure two
heuristics is not checked again on the subsumed grams.


3.2   Passage Retriever

We retrieved relevant Wikipedia articles for each tweet via the provided API of
the track. The methodology adopted by us can be described as follows. Each
extracted phrases for a given tweet was submitted as a query to Indri search
engine and we obtained three different files for our purpose in the following
format:

 – The “docid” files contain the sentences which we retrieved from the API.
   The sentences which collected from the same document are merged, stored
   and then used as input for the summarization component.
 – The docid and the phrase rank of the corresponding sentences are stored
   into the “docid.id” file
 – The docid and the resultant scores stored into the file “docid.score”. The
   average scores calculated for the same document phrase id. These scores use
   to submit as a part of our run.
4

3.3    Summarizer
We make use of MEAD toolkit2 for this component. MEAD is a multi-document
summarization system proposed by Radev et al. [7] implemented centroid-based
approach and is then enhanced with various of features later. We adapted the
system with various parameter settings including position, similarity with the
first sentence, centroid, query-based features, MEAD-cosine similarity routine
re-ranker with threshold value = 0.7 and enidf IDF database.


4     Results
The output summaries were evaluated according to their informativeness and
readability. Table 1 and Table 2 compare the performance of our submitted run
with the best one at INEX 2013 in terms of informativeness and readability,
respectively.


                    RunID Rank Unigram Bigram Skip Bigram
                     266   22   0.9059 0.9824    0.9835
                     256   1    0.8861 0.881      0.782

Table 1: Comparison of submitted runs and the best run in terms of informativeness
score at INEX 2013


    RunID Relevancy(T) Non redundancy(R) Soundness(A) Syntax(S) Mean Average
     266     25.92%          25.08%         25.92%     25.92%     25.64%
     275     76.64%          67.30%         74.52%     75.50%     72.44%

Table 2: Comparison of submitted runs and the best run in terms of readability score
at INEX 2013


    We observed that the phrases extracted from tweets contains some unex-
pected noises which need to be cleaner. A heuristics-based approach relies heav-
ily on a small set of tweets to be scrutinized, and it is difficult to generalize in
the arbitrary domains of tweets. This can affect the retriever components where
irrelevant sentences are retrieved as results of noisy phrases. Another observa-
tion is that creating the documents by merging retrieved sentences and treating
them as input for MEAD toolkit can make these documents less readable. One
key point in MEAD summarization is the assumption of relatedness between
2
    We     use    the   latest   version       MEAD        3.12    published      at
    http://www.summarization.com/mead/
                                                                                      5

sentences in one documents, and build a graph of inter-references from such re-
latedness. This does not really fit to the re-construction of tweets as conducted in
the first two steps of the pipeline. Nevertheless, this observation calls for future
approaches in text summarization, where sentences are less coupled and thus
should be modeled less dependently


5    Conclusion

The pipeline system has been developed as part of the participation in the Tweet
Contextualization track of INEX 2013. The system was evaluated by using the
evaluation metrics provided by the committees with reasonable results with its
initial implementation.
    Further works will be motivated towards improving the performance of the
system by enhancing the quality of phrases from tweets, considering semantic
similarity for retrieving relevant documents.


References
1. lvaro Rodrigo, Prez-iglesias, J., Peas, A., Garrido, G., Araujo, L.: A question an-
   swering system based on information retrieval and validation (2010)
2. Schiffman, B., Mckeown, K.R., Grishman, R.: Question answering using inte-
   grated information retrieval and information extraction. In: in Proceedings of
   HLT/NAACL. (2007)
3. Bhaskar, P., B.S.: A hybrid tweet contextualization system using ir and summa-
   rization. In: in Proceedings of INEX 2012. (2012)
4. Morchid, M., Linares, G.: A semantic space for tweets contextualization. In: in
   Proceedings of INEX 2012. (2012)
5. Torres-Moreno, J.M., Velazquez-Morales, P.: Two statistical summarizers at inex
   2012. In: in Proceedings of INEX 2012. (2012)
6. Debasis Ganguly, J.L., Jones, G.J.F.: Exploring sentence retrieval for tweet contex-
   tualization. In: in Proceedings of INEX 2012. (2012)
7. Radev, D., Allison, T., Blair-Goldensohn, S., Blitzer, J., Çelebi, A., Dimitrov, S.,
   Drabek, E., Hakim, A., Lam, W., Liu, D., Otterbacher, J., Qi, H., Saggion, H.,
   Teufel, S., Topper, M., Winkel, A., Zhang, Z.: MEAD — A platform for multidocu-
   ment multilingual text summarization. In: Conference on Language Resources and
   Evaluation (LREC), Lisbon, Portugal (2004)
8. Gimpel, K., Schneider, N., O’Connor, B., Das, D., Mills, D., Eisenstein, J., Heilman,
   M., Yogatama, D., Flanigan, J., Smith, N.A.: Part-of-speech tagging for twitter: an-
   notation, features, and experiments. In: Proceedings of the 49th Annual Meeting
   of the Association for Computational Linguistics: Human Language Technologies:
   short papers - Volume 2. HLT ’11, Stroudsburg, PA, USA, Association for Compu-
   tational Linguistics (2011) 42–47