=Paper= {{Paper |id=Vol-1178/CLEF2012wn-INEX-BandyopadhyayEt2012 |storemode=property |title=Passage Retrieval for Tweet Contextualization at INEX 2012 |pdfUrl=https://ceur-ws.org/Vol-1178/CLEF2012wn-INEX-BandyopadhyayEt2012.pdf |volume=Vol-1178 |dblpUrl=https://dblp.org/rec/conf/clef/BandyopadhyayPMMG12 }} ==Passage Retrieval for Tweet Contextualization at INEX 2012 == https://ceur-ws.org/Vol-1178/CLEF2012wn-INEX-BandyopadhyayEt2012.pdf
    Passage retrieval for tweet contextualization at
                       INEX 2012

Ayan Bandyopadhyay1 , Sukomal Pal2 , Mandar Mitra1 , Prasenjit Majumder3 ,
                       and Kripabandhu Ghosh1
                         1
                             Indian Statistical Institute (Kolkata)
                                      2
                                        ISM Dhanbad
                                  3
                                    DAIICT Gandhinagar



        Abstract. This paper describes some preliminary results obtained by
        treating the tweet contextualization task as a passage retrieval task. Each
        tweet was submitted as a query to the Indri 5.2 search engine after some
        preprocessing. Either paragraphs or sentences were retrieved in response
        to a query. Passages retrieved from the same document were concate-
        nated. This approach does not work very well in terms of informativeness:
        the best of our runs was ranked 23rd out of 33 runs. Further exploration
        of ways to improve effectiveness is needed.


1     Introduction

The INEX tweet contextualization task at CLEF 2012 is a new task. The aim of
this task is to provide some context for a given topic tweet 1 . For this task, the
context consists of a passage of at most 500 words extracted from a cleaned dump
of the English Wikipedia. It is intended to provide some background information
that will help a user to better understand the tweet.
    In this report, we describe our very preliminary attempts at tweet contextu-
alization. To begin with, we have simply treated contextualization as a passage
retrieval task. After some preprocessing, the textual content of a tweet is used
as a query to retrieve paragraphs or sentences from the Wikipedia corpus. If
multiple passages are retrieved from the same article, they are merged together.
    Related work is discussed in the next section (Section 2). Our approach is
described in Section 3. Section 4 presents our results and discusses some obvious
limitations of our approach. Our plans for further experimentation are outlined
in Section 5.


2     Related Works

The tweet contextualization task is introduced by INEX at CLEF 2012. Bellot
et al. [1] describes overall report of the INEX 2011. This task is involved with
tweet. Tweets are treated as topics here. http://twitter.com is one of the
1
    http://twitter.com
popular site of microblogging. Miles Efron [2] reveals an overview of microblog
and behavior surrounding it e.g microblog retrieval, entity search, sentiment
analysis. According to the passage retrieval point of view Robertson et al. [4]
says why we should not use liner equation to merge passages retrieved form the
same document. After the passage retrieval answer construction is the next part.
Summarization and framing answer has a very important role. Salton et al. [5]
says about automatic text summarization using Intra-document passage links.
recent text summarization survey by Ani Nenkova et al. [3] helps to know a
elaborate description of text summarization.


3   Experimental Setup
We divided each page in the corpus into separate paragraphs using the 

and

tags. All text contained between these tags was indexed. Each para- graph was also split further into sentences using periods (.), question marks (?) and exclamation marks (!) as sentence delimiters. Stopwords were removed, and Porter’s stemmer was used. Some statistics about the processed corpus are given below. Since any period (.) was regarded as an end-of-sentence marker, abbrevi- Table 1. Comparison of paragraph and sentence level indexing and corpus statistics Paragraph Level Sentence Level Number of paragraph/sentence 8,388,955 26,039,270 Unique terms 2,878,685 2,876,680 Total terms 333,522,647 333,697,767 ations were also split up when the text was indexed at the sentence level. This is why the number of terms (total and distinct) is somewhat different when the same text is indexed at two levels of granularity. The topic tweets (1142 in all) were provided in two formats: JSON and simple text. We used the simple text format. Stopwords, URLs, the name of the tweeting authority, and the text “RT” were removed. The remaining words were stemmed using Porter’s stemmer. Using these preprocessed tweets as queries, and Indri 5.2 as the search engine, we retrieved in turn paragraphs and sentences for each query tweet. A total of three runs were submitted. Details about these runs are given below. Run1 — Top 50 returned paragraphs were submitted. If multiple paragraphs were retrieved from a document, then those paragraphs were concatenated. The similarity scores of individual paragraphs were simply added together to obtain the score of the concatenated result. Any paragraph longer than 500 words (including those obtained by concatenation) was truncated to the first 500 words. Run2 — Same as the Run1, except that we started with the top 100 sentences for each query. Run3 — Same as the Run1, except that the top 100 paragraphs were used. 4 Results Submitted summaries were evaluated according to their informativeness and readability. Table 2 compares the performance of our submitted runs (Run1, Run2, Run3) with the best run at INEX 2012. Table 2. Comparison of submitted runs and the best run at INEX 2012 Run Name Run ID Rank Unigram Bigram Skip Bigram (out of 33) Run1 149 26 0.9059 0.9916 0.9916 Run2 150 23 0.9052 0.9871 0.9868 Run3 151 33 0.9223 0.9985 0.9988 Best 178 1 0.7734 0.8616 0.8623 It is clear that the overly simplistic approach that we tried did not perform well with regard to informativeness (they did obtain good readability, however). Out of these runs, the sentence-level run performs best. A number of obvious drawbacks need to be rectified. – When multiple paragraphs / sentences from a single document are concate- nated, their similarity scores are simply added together. This may lead to poor ranking [4]. The score of the combined passage needs to be calculated more carefully. – We need to be more careful when splitting a paragraph into sentences. In particular, periods used with acronyms and abbreviations should not result in sentence breaks. – Retrieved passages are arbitrarily truncated at 500 words, without checking for sentence boundaries. 5 Conclusion As mentioned in Section 2, a number of query-oriented summarisation approaches have been proposed in earlier work. In future work, we intend to explore how these may be applied to the contextualization task. Also, given that the “top- ics” or tweets are short to start with (at most 140 characters, many of which are taken up by URLs), query expansion is likely to be beneficial. We also hope to investigate query expansion / reformulation techniques as ways to improve informativeness of the generated summaries. References 1. Patrice Bellot, Timothy Chappell, Antoine Doucet, Shlomo Geva, Jaap Kamps, Gabriella Kazai, Marijn Koolen, Monica Landoni, Maarten Marx, Véronique Moriceau, Josiane Mothe, G. Ramı́rez, Mark Sanderson, Eric SanJuan, Falk Scholer, Xavier Tannier, Martin Theobald, M. Trappett, Andrew Trotman, and Q. Wang. Report on INEX 2011. SIGIR Forum, 46(1):33–42, 2012. 2. Miles Efron. Information search and retrieval in microblogs. JASIST, 62(6):996– 1008, 2011. 3. Ani Nenkova and Kathleen McKeown. Automatic summarization. Foundations and Trends in Information Retrieval, 5(2-3):103–233, 2011. 4. Stephen Robertson, Hugo Zaragoza, and Michael Taylor. Simple BM25 extension to multiple weighted fields. In Proc. CIKM, pages 42–49. ACM, 2004. 5. Gerard Salton, Amit Singhal, Mandar Mitra, and Chris Buckley. Automatic text structuring and summarization. Inf. Process. Manage., 33(2):193–207, 1997.