=Paper= {{Paper |id=None |storemode=property |title=Mining Diverse Views from Related Articles |pdfUrl=https://ceur-ws.org/Vol-762/paper6.pdf |volume=Vol-762 }} ==Mining Diverse Views from Related Articles== https://ceur-ws.org/Vol-762/paper6.pdf
                 Mining Diverse Views from Related Articles

                       Ravali Pochampally                                    Kamalakar Karlapalem
                    Center for Data Engineering                              Center for Data Engineering
                          IIIT Hyderabad                                           IIIT Hyderabad
                         Hyderabad, India                                         Hyderabad, India
                   ravali@research.iiit.ac.in                                     kamal@iiit.ac.in


ABSTRACT                                                           Articles which pertain to a common topic (e.g swine-flu in
The world wide web allows for diverse articles to be available     India) are termed as ‘related’. By isolating views we aim to
on a news event, product or any topic. It is not impossible to     organize content in a detailed manner than that of summa-
find a few hundred articles that discuss a specific topic thus     rization. We define a view as
making it difficult for a user to quickly process the informa-
tion. Summarization condenses huge volume of information
related to a topic but does not provide a delineation of the       A sentence or a set of sentences which broadly relate to an
issues pertaining to it. We want to extract the diverse issues     issue addressed by a collection of related articles and aid in
pertaining to a topic by mining views from a collection of                elaborating the different aspects of that issue
articles related to it. A view is a set of sentences, related in
content, that address an issue relevant to a topic. We present     1.1 Motivating Example
a framework for extraction and ranking of views and have           Here is a pair of views obtained by our framework. Both
conducted experiments to evaluate the framework.                   the views are mined from Dataset 1. The number in the
                                                                   curly brackets indicates the ID of the article from which the
Categories and Subject Descriptors                                 sentence is extracted. Description of datasets is given in
H.5 [Information Systems]: Information Interfaces and              Table 1.
Presentation
                                                                   Example Views

General Terms
Human Factors, Experimentation                                       1. The irresponsibility of the financial elite and US ad-
                                                                        ministrations has led the US economy to the brink of
                                                                        collapse. {18} On Friday, the Dow was down a mere
Keywords                                                                0.3% on the week - but to get there, the Fed and the
text mining, views, diversity, information retrieval                    Treasury had to pump hundreds of billions into the
                                                                        global financial system. {14} The collapse of the oldest
                                                                        investment bank in the country could strongly under-
1.   INTRODUCTION                                                       mine the whole US financial system and increase the
The world wide web is a storehouse of information. Users                credit crisis. {3} After a week that saw the collapse
who want to comprehend the content of a particular topic                of Lehman Brothers, the bailout of the insurer AIG
(e.g. FIFA 2010) are often overwhelmed by the volume of                 and the fire sale of Merrill Lynch and the British bank
text available on the web. Websites which organize infor-               HBOS, policy makers hit back, orchestrating a huge
mation based on content (google news1 ) and/or user ratings             plan to sustain credit markets and banning short sales
(amazon2 , imdb3 ) also output several pages of text in re-             of stock. {48} It was a dramatic reversal from the first
sponse to a query. It is difficult for an end-user to process           half of the week, when credit markets virtually seized
all the text presented.                                                 up and stocks around the globe plunged amid mount-
                                                                        ing fears for the health of the financial system. {18}
Multi-Document Summarization [2] is a prominent Informa-
tion Retrieval (IR) technique to deal with this problem of           2. The Swiss National Bank is to pump USD 27 billion
information overload. But summaries typically lack the se-              into markets and the Bank of Japan (BOJ) valued its
mantic grouping to present the multiple views addressed by              part in the currency swap with the Federal Reserve at
a group of articles. Providing diverse views and allowing               60 billion. {35} The Bank of Canada was also involved,
users to browse through them will faciliate the goal of in-             and The Bank of England said it would flood 40 billion
formation exploration by providing the user a definite and              into the markets. {26} And, despite the agreements
detailed snapshot of their topic of interest.                           that Barclays Capital and Bank of America will sign
                                                                        with executives at Lehman Brothers or Merrill Lynch,
1
  http://news.google.com/                                               it is the hunting season in the banking world for the
2
  http://www.amazon.com/                                                crème de la crème. {14}
3
  http://www.imdb.com/
The first view details the breakdown of the US economy             ID        Source                 Search Term          # Articles
along with a few signs of damage control. The second view           1      google news           financial meltdown         49
reports the actions of various banks during the financial tur-      2      google news              swine flu india         100
moil in 2008. These views capture a glimpse of the specific         3      google news           israel attacks gaza        24
issues pertaining to the topic of ‘financial meltdown’. A list      4      amazon.com               the lost symbol         25
of such diverse views would organize the content of a collec-       5    tripadvisor.com           hotel taj krishna        20
tion of related articles and provide a perspective into that        6    tripadvisor.com             hotel marriott         16
collection.                                                         7      google news                fifa vuvuzela         39
                                                                    8      google news                 gulf oil spill       26
The problem statement is
                                                                                        Table 1: Datasets
 Given a corpus of related articles A, identify the set V of
views pertaining to A, rank V and detect the most relevant        1.3 Contributions
   view (MRV) along with the set of outlier views (OV)            The main contributions of this work are

1.2 Related Work
                                                                    1. Defining the concept of a view over a corpus of related
Allison et. al [1] [8] proposed that providing multiple view-
                                                                       articles
points of a document collection and allowing to move among
these view-points will facilitate the location of useful docu-
ments. Representations, processes and frameworks required           2. Presenting a framework for mining diverse views
for developing multiple view-points were put forth.
                                                                    3. Ranking the views based on a quality parameter (cohesion)
Tombros et al. [10] proposed the clustering of Top-Ranking             defined by us and
Sentences (TRS) for efficient information access. Cluster-
ing and summarization were combined in a novel way to
generate a personalized information space. Clusters of TRS          4. Presenting results to validate the framework
were generated by a hierarchical clustering algorithm using
the group-average-link method. It was argued that TRS
clustering presents better information access than routine
                                                                  1.4 Organization
                                                                  In section 2, we elaborate on the framework for the extrac-
document clustering.
                                                                  tion of views. MRV, OV and the ranking mechanism are
TextTiling [5] is a technique for subdividing text into multi-    explained in detail in section 2.5. Section 3 is for experi-
paragraph units that represent passages or subtopics. It          mental evaluation and discussion. In section 4, we sum up
                                                                  our contributions and outline the future work.
makes use of patterns of lexical co-occurence and distri-
bution. The algorithm has three parts: tokenization into
sentence-sized units, determination of a score for each unit
and detection of sub-topic boundaries. Sub-topic boundaries
                                                                  2. EXTRACTION OF VIEWS
                                                                  In this section, we detail the steps involved in the extraction
are assumed to occur at the largest valleys in the graph that
                                                                  of views and define a quality parameter for ranking the views
result from plotting sentence-units against scores.
                                                                  according to their relevance. Figure 1 presents an overview
                                                                  of the framework by depicting the steps involved in the al-
1.2.1 Views vs. Summary                                           gorithm. Input and output are specified for each step of the
Summary and views generated for Dataset 5 are here -              algorithm.
(https://sites.google.com/site/diverseviews/comparison)
The summary is generated by update summarization ‘base-                            HTML + Text                      Raw Text
line algorithm’ [6]. It is conspicuous by the lack of organi-           Set of                                                  Extracting
                                                                                                   Data
zation. Though successful in covering the salient features              Related
                                                                                                   Cleaning &                   Top-ranking
                                                                        Articles
of the review dataset, it groups several conflicting sentences          (A)
                                                                                                   Preprocessing                Sentences

together (observe the last two sentences of the summary).
The views generated by our framework present an organized                                                               Top n
representation by generating clusters of semantically related                                                           Sentences
sentences. As is evident, the first view is discussing the pos-
itive attributes of hotel taj krishna in hyderabad while the                                      Ranking By                   Clustering
second view is negative in tone. The third and fourth views             Ranked                    Quality                      Engine
                                                                        Views &                   Parameter
discuss specific aspects of the hotel such as the food and              MRV                       (Cohesion)
the facilities available. Presenting multiple views for a topic         OV

allows us to model the diversity in its content. Our repre-                         Ranked                          Views
sentation is concise as the average number of sentences per                         Views

view was found to be 3.9. In our framework, we address
two drawbacks of summarization - lack of organization and
verbosity (due to user-specified parameters).
                                                                                      Figure 1: Framework
2.1 Datasets                                                      Ti,j           :   tf − idfi,j
Articles which make relevant points about a common topic          tf − idfi,j    :   TF-IDF of term ti in article dj
but score low on pairwise cosine similarity can be included       tf − idfi,j    :   tfi,j ∗ idfi
                                                                                        n
in our datasets because we aim to present multiple views          tfi,j          :   P i,j
from a set of related articles, rather than group them based                            k nk,j
                                                                  n
                                                                  Pi,j           :   Number of occurences of ti in article dj
                                                                                     P
on overall content similarity. We used data from news ag-
                                                                     k nk,j      :       of occurences ∀ tk in article dj
gregator and review web sites as they group articles dis-
                                                                                              |D|
cussing a common topic, inspite of the low semantic similar-      idfi           :   log
ity between them. We crawled articles published between                                   |d : ti ∈ d|
                                                                  |D|            :   Total number of articles in the corpus
a range of dates when the activity pertaining to a relevant
                                                                  |d : ti ∈ d|   :   Number of articles which have the term ti
topic peaked. For example, we crawled articles published
on ‘gulf oil spill’ between 15 April 2010 and 15 July 2010
                                                                                      Table 2: Notations
when the news activity pertaining to that topic was maxi-
mum. We crawled websites which provided rss feeds or had
a static html format that could be parsed. Table 1 pro-          ing the top-ranked ones. A list of notations used in our
vides the description of datasets. For instance, Dataset 1 is    discussion is given in Table 2
collected from google news using the search term ‘financial
meltdown’ and contains 49 articles. Datasets can be found        Let < S1 , S2 , S3 ...Sn > be the set of sentences in an article
here - (https://sites.google.com/site/diverseviews/datasets)     collection. tf − idfi,j (TF-IDF) of a term ti in article dj is
                                                                 obtained by multiplying its weighted term frequency ti,j and
2.2 Data Cleaning and Preprocessing                              inverse document frequency idfi . A high value of tf − idfi,j
Web data was collected using Jobo4 , a java crawler. The         (Ti,j ) is attained by a term ti which has a high frequency in a
data was given as an input to the data cleaning and pre-         given article dj and low occurence rate among the spectrum
processing stage. Data Cleaning is important as it parses        of articles present in that collection. Appearance of some
the html data and removes duplicates from the articles. We       words in an article is more indicative of the issues addressed
define a ‘duplicate’ as an article having the exact syntactic    by it than others. Ti,j is a re-weighting of word importance,
terms and sequences, with or without the formatting dif-         though it increases proportionally by the number of times a
ferences. Hence, by our definition, duplicates have a cosine     word appears in an article, it is offset by the frequency of
similarity value of one.                                         the word in the corpus. We consider a product of the Ti,j
                                                                 of constituent words in a sentence to be a good indicator of
Text data devoid of html tags is given as an input to the        its significance. A product can be biased by the number of
data preprocessing stage. Stemming and stopword removal          words in a sentence hence, we normalize the product by di-
are performed in the preprocessing stage. Stemming is the        viding it with the length of the sentence. Given the notation
process of reducing inflected (or derived) words to their stem   above, we thus define the importance Ik , of a sentence Sk ,
or root form. (example: running to run, parks to park etc.)      belonging to an article dj and having r constituent words as
In most cases, these morphological variants of words have
similar semantic interpretations and can be considered as                                       Qr
                                                                                                 i=1 Ti,j
equivalent for the purpose of IR applications. Stopwords                                 Ik =
                                                                                                     r
are the highly frequent words in english language (example:
a, an, the, etc.). Owing to their high frequency, and usage
as conjunctions and prepositions, they do not add any sig-                 Ti,j = tf − idf of word wi ∈ Sk ∧ dj
nificant meaning to the content. Hence, their removal is es-      Ik = product of tf − idf of ∀wi normalized according to
sential to remove superfluous content and retain the essence                      sentence (Sk ) length, r
of the article. In order to capture the user notion, the re-
view datasets were not checked for typographical and gram-
matical errors and were retained verbatim. Python mod-           Logarithm normalization was not used as the σ value for
ules HTMLParser5 and nltk.wordnet6 were used to parse            r was 2.2 and variance in its value was not exponential.
the html data and perform stemming respectively. IR met-         Sentences are arranged in the non-increasing order of their
rics such as word frequency and TF-IDF7 were extracted for       importance (I) scores. We choose the top n sentences for our
future analysis.                                                 analysis. Experiments are conducted to correlate the range
                                                                 of n with the corresponding score obtained by our ranking
                                                                 parameter.
2.3 Extraction of Top-Ranking Sentences
A dataset consisting of many articles and having content         2.4 Mining Diverse Views
spanning various issues needs an pruning mechanism to ex-        A measure of similarity between two sentences is required
tract sentences from which the views can be generated. We        to extract semantically related views from them. Semantic
prune a dataset by scoring each sentence in it and extract-      similarity calculates the correlation score between sentences
4
  http://java-source.net/open-source/crawlers/jobo               based on the likeness of their meaning. Mihalcea et al. [7]
5
  http://docs.python.org/library/htmlparser.html                 proposed that the specif icity of a word can be determined
6
  http://www.opendocs.net/nltk/0.9.5/api/nltk.wordnet-           using its inverse document frequency (idf). Using a metric
module.html                                                      for word to word similarity and specificity, the semantic sim-
7                                                                ilarity of two text sentences Si and Sj , where w represents
  http://nlp.stanford.edu/IR-book/html/htmledition/tf-idf-
weighting-1.html                                                 a word in a sentence, is defined by them as
                             P
                             w∈{Si } (maxSim(w,Sj )∗idf (w))
      sim(Si , Sj ) = 21 (            P                         +   As per our definition, higher the value of cohesion, greater is
                                          w∈{Si } idf (w)
                 P                                                  the content similarity between the sentences of a view. Our
                   w∈{Sj } (maxSim(w,Si )∗idf (w))
                         P                                  )       framework wanted to ascribe importance to views with max-
                             w∈{Sj } idf (w)
                                                                    imum pairwise semantic similarity. Thus, we defined Most
                                                                    Relevant View (MRV) as the view with maximum value of
This metric is used for our analysis as it combines the se-         cohesion, i.e., maximum content overlap amongst its con-
mantic similarities of each text segment with respect to the        stituent sentences. Outlier views (OV) represent the set of
other. For each word w in the segment Si , we identify the          views containing a single sentence. They are termed as out-
word in segment Sj that has the highest semantic similarity,        liers because their semantic similarity with others is too low
i.e. maxSim(w, Sj ), according to some pre-defined word-to-         to have any meaningful grouping. We rank all the views in
word similarity measures. Next, the same process is applied         the non-increasing order of their cohesion. As their corre-
to determine the most similar word in Sj with respect to            sponding pair-wise similarity is zero, outlier views have a
the words in Si . The word similarities are then weighed            cohesion value of zero. Hence, we order outlier views ac-
with corresponding word specificities, summed up and nor-           cording to their importance (I) scores.
malized according to the length of each sentence.
                                                                    2.6 Framework for Extracting Views
Wordnet based similarity measures score well in recogniz-           Algorithm 1 provides the steps involved in mining diverse
ing semantic relatedness [7]. Pucher [9] has carried out the        views from a set of related articles. The articles are cleaned
performance evaluation of all the wordnet based semantic            by parsing the html and removing duplicates. IR metrics
similarity measures and found that wup [4] is one of the top        such as TF-IDF are collected before calculating the impor-
performers in capturing semantic relatedness. We also chose         tance (I) of each sentence. The sentences are ranked in the
wup because it is based on the path length between synsets          non-increasing order of their importance to pick the top n
of words and its performance is consistent across various           sentences. We calculate the pair-wise semantic similarity
parts-of-speech (POS). We used Python nltk.corpus8 to im-           between the chosen sentences to cluster them. Clustering
plement wup. Pairwise semantic similarity sim(Si , Sj ) or          is used to generate semantically related views from a set of
si,j is a symmetric relation. Thus, we used the upper train-        disparate sentences. We rank the views according to the
gle of the similarity matrix (X) to reduce computational            quality parameter proposed by us.
overhead.
                                                                    3. EXPERIMENTAL EVALUATION
                  ∀si,j ∈ X =⇒ {si,j = sj,i }                       Extraction of Top-Ranking sentences requires the number
                                                                    of constituent sentences (n) as an input. The ideal range of
                                                                    values for an input parameter is the one which can maximize
We used clustering to proceed from a set of sentences to            the cohesion of views and determining it is a critical part of
views containing similar content. The similarity-matrix (X)         our framework. Hence, we analysed the result data to find
was given as an input to Python scipy-cluster9 which uses           the relevant range for n.
Hierarchical Agglomerative Clustering (HAC). HAC was used
because we can terminate the clustering when the values of          An input parameter producing views where the median co-
the scoring parameter converge without explicitly specifying        hesion is greater than (or equal to) the mean is preferred.
the number of clusters to output.                                   As the mean is influenced by the outliers in a dataset, the
                                                                    median being at least as high as the mean indicates con-
Each cluster comprises of sentences grouped according to            sistency across the values of cohesion. If all the values of
the similarity measure (si,j ) discussed above. Hence, it is        mean cohesion are greater than that of median, the input
logical to treat them as views discussing a specific issue. In      parameter yielding views with the maximum mean cohesion
the next section, we propose a quality parameter for the            is preferred.
ranking and evaluation of views.
                                                                    We collected statistics about the cohesion (mean, median),
2.5 Ranking of Views                                                number of views, outliers etc. for values of n equal to 20,
                                                                    25, 30, 35, 40, and 50. The results are presented in Table 5.
Qualitative parameter for ranking the views focuses on av-
                                                                    ID indicates the dataset-ID (as per Table 1), TRS stands for
erage pairwise similarity between constituent sentences of a
                                                                    the number of Top-Ranking sentences, V and O stand for
view V in order to define its cohesion (C). We define cohe-
                                                                    the number of views and outliers respectively.
sion as

                                 P                                  Figures 2 to 9 plot the variation in the mean and median
                                     i,j∈V si,j                     cohesion in relation to the number of TRS (n). The value of
                        C=                                          n is plotted on the horizontal axis and the value of cohesion
                                  len(V )
                                                                    is plotted on the vertical axis. We can deduce from the
                   si,j = sim(Ti , Tj )                             graphs that the mean and median cohesion are peaking for
      V = set of sentences (Ti ) comprising the view                20 ≤ n ≤ 35. The exact breakup of the value of n yielding
       len(V ) = number of sentences in the view.                   the best cohesion for all the datasets is provided in Table 3.

8
  http://nltk.googlecode.com/svn/trunk/doc/api/nltk.corpus-         As evident from our results, choosing more top-ranking sen-
module.html                                                         tences need not necessarily lead to views with better cohe-
9
  http://code.google.com/p/scipy-cluster                            sion. To extract views with best cohesion one can start with
                                                            a lower bound (e.g. 20) of top-ranking sentences and incre-
                                                            mentally add x sentences until one reaches an upper bound
                                                            (e.g. 35). Incremental clustering [3] can be used to obtain
                                                            views. The cohesion values can be compared to present the
                                                            set of views which yield the best cohesion. Below we present
                                                            three views mined by our framework. The value of n for each
                                                            view is the one which yields best cohesion for that dataset
                                                            (as presented in Table 3)

                                                            Example 1 | fifa vuvuzela (7) | n: 35 | cohesion: 40.71
Algorithm 1 Mining Diverse Views                            (MRV)
Require: Related Articles A                                 The true origin of the vuvuzela is disputed, but Mkhondo
Ensure: Ranked Views V with M RV and OV                     and others say the tradition dates back centuries - ”to our
 1: for all a in A do                                       forefathers” - and involves the kudu.{5} The plastic trum-
 2:     aClean ← ParseHTML(a)                               pets, which can produce noise levels in excess of 140 decibels,
 3:                                                         have become the defining symbol of the 2010 World Cup.
 4:     if aClean is not duplicate then                     {12} For this reason, there is no doubt that the vuvuzela
 5:          ACLEAN ← ACLEAN + aClean                       will become one of the legacies that Africa will hand over to
 6:     else                                                the world after the world cup tournament, since the Euro-
 7:          discard aClean                                 peans, Americans and Asians could not resist the temptation
 8:     end if                                              of using it and are seen holding it to watch their matches.
 9: end for                                                 {3} Have you ever found yourself in bed in a dark room with
10: for all a in ACLEAN do                                  just a single mosquito for company? The buzzing sound of
11:      a ← removeStopwords(a) //ranks.NL stopwords        the vibrations made by the mosquito’s wings. {10} On the
12:      ASTEM ← ASTEM + stem(a) //nltk stemmer             other hand, its ban will affect the mood of the host nation
13: end for                                                 and, of course, other African countries at the world cup, be-
14: for all a in ASTEM do                                   cause of the deep rooted emotions attached to it by fans.
15:                                                         {3} This has sparked another controversy in the course of
16:      for all word in a do                               the tournament and has become the single item for discus-
17:           computeTFIDF(word)                            sion in the media since the LOC made that controversial
18:      end for                                            statement on Sunday evening. {3}
19: end for
20: for all sentence in ASTEM do                            Example 2 | swine flu india (2) | n: 25 | cohesion:
21:      rankedSentences ← calculateImportance(sentence)    4.52 (Rank 4)
        //section 2.3                                       Patnaik, who created the image with the help of his students,
22: end for                                                 on the golden beach has depicted the pig wearing a mask
    topN ← pickTOPsentences(rankedSentences,n) //as per     with the message ‘Beware of swine flu’. The sculpture was
    importance (I)                                          put on display late Thursday on the beach in Puri, 56 km
23: for all sentence1 as s1 in topN do                      from the state capital Bhubaneswar. {18} Of the six cases
24:      for all sentence2 as s2 in topN do                 reported in Pune, three are students who contracted the
25:           if (s1,s2) not in simMatrix then              virus in the school. {91}
26:               simMatrix        ←       simMatrix  +
                  calculateSimilarity(s1,s2)                Example 3 | the lost symbol (4) | n: 20 | cohesion:
27:           end if                                        40.02 (MRV)
28:      end for                                            I read the book as fast as I could. Of course as a Dan
29: end for                                                 Brown classic, it was very interesting, exciting and made
    rawViews←clusteringEngine(simMatrix)//scipy-cluster     me wanting to read as fast as I could. {13} Every symbol,
30: for all view in rawViews do                             every ritual, every society, all of it, even the corridors and
31:      views←views+calculateCohesion(view)//section 2.5   tunnels below Washington, DC, it’s all real. {3} I feel more
32: end for                                                 connected to the message of this book (the reach and the
    rankedViews ← rankByCohesion(views)                     power of the human mind) than I did to possibility that
    M RV ← chooseMaxCohesion(rankedViews)                   Jesus had a child. {12} Malakh is after secret knowledge
    OV ← chooseZeroCohesion(rankedViews)                    guarded by the Masons and he’ll stop at nothing to get it.
                                                            To that end he’s kidnapped Peter Solomon, the head of the
                                                            Masonic order in Washington, DC. {1} Malakh is about to
                                                            reach the highest level in the Masonic order, but even so,
                                                            he knows he will not be admitted to the most secret secrets.
                                                            Secrets that he’s sworn to protect. He is not what he seems
                                                            to his fellow Masons. He’s lied to them. He has his own
                                                            agenda. {2} [sic]

                                                            The first and third examples were ranked first (MRV) by
                                                            our framework and the second one was ranked fourth. If we
examine the first example, a user who does not know the          Dataset               :   Number of TRS
term ‘vuvuzela’ can immediately glean that it is a plas-         financial meltdown    :   25
tic trumpet which caused quite a stir in the fifa world cup      swine flu india       :   25
2010. There are also some sentences which insinuate toward       israel attacks gaza   :   30
a likely ban and surrounding controversy. In an ideal sce-       the lost symbol       :   20
nario, we would like to group sentences about the ban and        hotel taj krishna     :   20
the controversy in another view, but as it stands now, our       hotel marriott        :   25
view describes the instrument and the impact of vuvuzela         fifa vuvuzela         :   35
on the world cup and serves as a good introduction to a          gulf oil spill        :   30
novice or as a concise issue capsule to a user who is already
familiar with the topic.                                                         Table 3: Breakup of n

Similarly, the second example which was ranked fourth by         Dataset               Mean (S)      Mean (N)
our framework talks about the repercussions of the disease       financial meltdown    3.91          3.17
swine flu on pune and puri (cities in India). The third exam-    swine flu india       4.26          5.67
ple, ranked first, contains some positive opinions about the     israel attacks gaza   3.74          4.17
book ‘The Lost Symbol’ and also a sneak peek into the in-        the lost symbol       4.16          5.33
tentions of the character Malakh. Additional example views       hotel taj krishna     3.67          4.17
are provided in the appendix.                                    hotel marriott        3.82          5.83
                                                                 fifa vuvuzela         3.56          5.33
The average number of sentences across all the views was         gulf oil spill        4.21          5.00
found to be 3.9 and the average number of views across all
the datasets was found to be 4.88. Table 4 presents the                          Table 4: Mean values
breakup for each dataset. Mean (S) indicates the average
number of sentences across all the views, and Mean (N)
indicates the average number of views. The implementa-
tion of the framework as described in Algorithm 1 took an
upper-bound of 4.2 seconds to run, with computeTFIDF and
calculateImportance being the time consuming steps at 2.6
seconds.
                                                                 5. REFERENCES
The main difference between summarization and our frame-         [1] J. C. F. Allison L. Powell. Using multiple views of a
work is that we provide multiple diverse views as opposed to         document collection in information exploration. In
summarization which lacks such an organization. We also              CHI’98: Information Exploration Workshop, 1998.
rank these views thereby allowing a user to just look at the     [2] R. K. Ando, B. K. Boguraev, R. J. Byrd, and M. S.
Most Relevant View (MRV) or the top x views as per his               Neff. Multi-document summarization by visualizing
convienience. As we provide the IDs of the source articles           topical content. In Proceedings of the 2000
in each view, a user can also browse through them to know            NAACL-ANLPWorkshop on Automatic
more about that view.                                                summarization - Volume 4, NAACL-ANLP-AutoSum
                                                                     ’00, pages 79–98, Stroudsburg, PA, USA, 2000.
                                                                     Association for Computational Linguistics.
4.   CONCLUSION                                                  [3] M. Charikar, C. Chekuri, T. Feder, and R. Motwani.
Users who want to browse the content of a topic on the               Incremental clustering and dynamic information
world wide web (www) have to wade through diverse arti-              retrieval. In Proceedings of the twenty-ninth annual
cles available on it. Though summarization is successful in          ACM symposium on Theory of computing, STOC ’97,
condensing huge volume of information, it groups several is-         pages 626–635, New York, NY, USA, 1997. ACM.
sues pertaining to a topic together and lacks an organized       [4] Z. W. Department and Z. Wu. Verb semantics and
representation of the underlying issues representing it. In          lexical selection. In In Proceedings of the 32nd Annual
this paper, we propose a framework to mine the multiple              Meeting of the Association for Computational
views addressed by a collection of articles. These views are         Linguistics, pages 133–138, 1994.
easily navigable and provide the user a detailed snapshot of     [5] M. A. Hearst. Texttiling: segmenting text into
their topic of interest. Our framework extends the concept           multi-paragraph subtopic passages. Comput. Linguist.,
of clustering to the sentence or phrase level (as opposed to         23:33–64, March 1997.
document clustering) and groups semantically related sen-        [6] R. Katragadda, P. Pingali, and V. Varma. Sentence
tences together to organize content in a way that is different       position revisited: a robust light-weight update
from text summarization.                                             summarization ’baseline’ algorithm. In CLIAWS3 ’09:
                                                                     Proceedings of the Third International Workshop on
In future, we want to determine the polarity of a view (posi-        Cross Lingual Information Access, pages 46–52,
tive/negative/neutral) by examining the adjectives in it. We         Morristown, NJ, USA, 2009. Association for
also want to incorporate user feedback by means of clicks,           Computational Linguistics.
time spent on a page (implicit) and ratings, numerical scores    [7] R. Mihalcea and C. Corley. Corpus-based and
(explicit) to evaluate the performance of our framework and          knowledge-based measures of text semantic similarity.
if possible, re-rank the views.                                      In In AAAI ’06, pages 775–780, 2006.
 [8] A. L. Powell and J. C. French. The potential to
     improve retrieval effectiveness with multiple
     viewpoints. Technical report, VA, USA, 1998.
 [9] M. Pucher. Performance evaluation of wordnet-based
     semantic relatedness measures for word prediction in
     coversational speech. In IWCS 6:Sixth International
     Workshop on Computational Semantics Tilburg,                 ID   TRS   Mean (C)   Median (C)   V   O
     Netherlands, 2005.                                                 20    17.58        17.6      3   10
[10] A. Tombros, J. M. Jose, and I. Ruthven. Clustering                 25    32.52       36.87      3   13
     top-ranking sentences for information access. In in                30    11.23       11.23      2   15
                                                                  1
     Proceedings of the 7 th ECDL Conference, pages                     35    10.78       10.78      2   18
     523–528, 2003.                                                     40     12.5       16.74      3   20
                                                                        50     4.43        4.69      6   25
APPENDIX                                                                20    18.86       15.63      4   10
                                                                        25     15.6        15.6      4   13
Example 4 | gulf oil spill (8) | n: 35 | cohesion: 16.64
                                                                        30    10.98        4.97      5   15
(Rank 2) BP and the Coast Guard are also using chemicals          2
                                                                        35    14.16        5.12      7   18
to disperse the oil, which for the most part is spread in a
                                                                        40    10.03        5.12      7   20
thin sheen. But the area of the sheen has expanded to more
                                                                        50    11.42        4.79      7   25
than 150 miles long and about 30 miles wide. {1} The Coast
Guard confirmed that the leading edge of the oil slick in the           20    13.32        4.52      3   12
Gulf of Mexico is three miles from Pass-A-Loutre Wildlife               25    17.34       14.82      4   15
Management Area, the Reuters news agency reported. The                  30    19.38       21.56      4   18
                                                                  3
area is at the mouth of the Mississippi River. {1} ”They’re             35    18.53       15.07      5   21
going to be focusing on the root cause, how the oil and gas             40     7.11        5.1       4   24
were able to enter the [well] that should’ve been secured,”             50    11.44        4.75      5   30
he said. ”That will be the primary focus, how the influx got            20    23.32       24.04      4   10
in to the [well].” {1}                                                  25    16.55        16.3      6   13
                                                                        30    20.37       16.86      5   15
                                                                  4
Example 5 | hotel marriott (6)| n: 30 | cohesion:                       35     7.18        5.07      5   18
15.23 (Rank 3) Well located hotel offering good view of                 40    13.13       11.25      6   20
the lake. The rooms are clean and comfortable and have all              50     8.46        4.54      6   25
amenities and facilities of a 5 star hotel. The hotel is not            20    10.61       10.61      2   10
overtly luxurious but meets all expectations of a business              25     7.34        5.58      3   13
traveller. The Indian restaurant is uper and a must-try. {6}            30     5.48        5.58      3   15
                                                                  5
The food is excellent and like I said, if it were not for the           35    17.25        5.58      7   18
smell and so-so servie, I would stay here. {14} The rooms               40    12.02        6.59      4   20
are great. Well lit, loaded with amenities and the trademark            50    10.64        5.11      6   25
big glass windows to look out.. The bathroom is trendy and              20    10.51        5.18      3   10
looks fabulous with rain shower and a bathtub. {12} [sic]               25    19.83       15.23      5   13
                                                                        30    14.94       10.21      6   15
                                                                  6
Example 6 | swine flu india (2) | n: 25 | cohesion:                     35    14.55       10.37      6   18
15.98 (Rank 3) Three new cases of swine flu were con-                   40      14        10.33      8   20
firmed in the city on Sunday, taking the total number of                50     7.87        4.47      7   25
those infected to 12 in the State. {5} ”Currently, it isn’t the         20    11.52        5.09      4   10
flu season in India, but if the cases keep coming in even af-           25    13.55        4.72      4   14
ter the rains, it will clash with our flu season (post-monsoon          30     11.9        4.73      5   16
                                                                  7
and winter period) which could be a problem”, he said. {55}             35    14.74        4.73      5   20
In Delhi, out of the four cases, three people, including two            40     8.35        4.59      6   26
children aged 12, contracted the virus from a person who                50       7         4.5       8   32
had the flu. {12}                                                       20    10.52       10.72      4   10
                                                                        25    13.95       10.55      4   14
Example 7 | financial meltdown (1) | n: 35 | cohesion:                  30    14.54        16.3      5   16
                                                                  8
0 (Outlier View) It has to be said: The model of the                    35    12.94       10.61      6   19
credit rating agencies has collapsed. Whether because of                40    10.92        4.58      5   27
their unprofessionalism or inherent conflicts of interest, the          50    11.81        4.72      6   34
fact that the agencies receive their pay from the companies
they cover has bankrupted the system. {11}                                      Table 5: Results

Example 8 | israel attacks gaza (3) | n: 40 | cohe-
sion: 0 (Outlier View) ”I heard the explosions when I
was standing in the hall for protection. Suddenly, in a few
seconds, all of the police and firemen were in the building,”
said resident Rachel Mor, 25. {21}
              40                                                                        40                                                                      40
                                                             Mean                                                                       Mean                                                                 Mean
                                                            Median                                                                     Median                                                               Median

              35                                                                        35                                                                      35



              30                                                                        30                                                                      30



              25                                                                        25                                                                      25
   Cohesion




                                                                             Cohesion




                                                                                                                                                     Cohesion
              20                                                                        20                                                                      20



              15                                                                        15                                                                      15



              10                                                                        10                                                                      10



               5                                                                         5                                                                       5



                   20      25   30          35         40   45       50                      20       25   30          35         40   45       50                   20      25   30        35         40   45       50
                                      Number of TRS                                                              Number of TRS                                                         Number of TRS




                   Figure 2: financial meltdown                                                    Figure 3: swine flu india                                         Figure 4: israel attacks gaza




              40                                                                        40                                                                      40
                                                             Mean                                                                       Mean                                                                 Mean
                                                            Median                                                                     Median                                                               Median

              35                                                                        35                                                                      35



              30                                                                        30                                                                      30



              25                                                                        25                                                                      25
   Cohesion




                                                                             Cohesion




                                                                                                                                                     Cohesion
              20                                                                        20                                                                      20



              15                                                                        15                                                                      15



              10                                                                        10                                                                      10



               5                                                                         5                                                                       5



                   20      25   30          35         40   45       50                      20       25   30          35         40   45       50                   20      25   30        35         40   45       50
                                      Number of TRS                                                              Number of TRS                                                         Number of TRS




                        Figure 5: the lost symbol                                                 Figure 6: hotel taj krishna                                             Figure 7: hotel marriott




              40                                                                        40
                                                             Mean                                                                       Mean
                                                            Median                                                                     Median

              35                                                                        35



              30                                                                        30



              25                                                                        25
Cohesion




                                                                          Cohesion




              20                                                                        20



              15                                                                        15



              10                                                                        10



              5                                                                         5



                   20      25   30         35          40   45       50                      20       25   30         35          40   45       50
                                     Top-N Sentences                                                            Top-N Sentences




                         Figure 8: fifa vuvuzela                                                    Figure 9: gulf oil spill